<img src="../style/img/vs265header.svg"/>

<h1 align="center">Lab 4 - Sparse, Distributed Representations <font color="red"> [SOLUTIONS] </font> </h1>

## Part 2 - Sparse Coding of Natural Images

In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import utils.plotFunctions as pf
import utils.helperFunctions as hf

In the first part of lab 4 we generated our own data, so we knew exactly what the underlying generators of the data were. Here, we are going to attempt to learn the generators of a richer input ensemble: images of the natural world. We can't do this with the Foldiak sparse coding model because it only works with binary signals. Instead, we are going to learn a sparse coding dictionary as was originally described in Olshausen & Field's 1996 and 1997 papers. The algorithm we are going to use to compute sparse codes is Rozell's Locally Competitive Algorithm (LCA), as described in his 2008 paper. LCA is explained in detail in a course handout, read that before going forward!

The training data are obtained by extracting small image patches from whitened natural scenes, which one can think of as an idealization of the input provided by the LGN, as we discussed in class when learning about whitening transforms.

Run the algorithm using 64 output neurons on 8x8 pixel image patches. To do this you must:

* Fill in the LCA equations in the `lcaSparsify` function.

* Fill in the $\phi$ update learning rule in the `updatePhi` function.

Verify that you can reconstruct an image from the set of learned features and comment on what was learned as well as the parameters used.

The code cell below sets the parameters for the sparse coding model. `numInputs` has been set to 64 to learn 8x8 pixel basis functions. This is not a hard constraint, so feel free to try out different patch sizes. You should also explore the effects of changing the `sparsityTradeoff` (i.e. $\lambda$) parameter. The LCA model has two key additional parameters: how long to perform inference is set by `numSteps` and the membrane integration time constant is set by the variable `tau` (i.e. $\tau$). A lower value for $\tau$ causes the LCA model to perform a more coarse estimate of the true dynamics and therefore come to an approximate solution in fewer steps. FInally, the `numOutputs` parameter establishes the overcompleteness of the model. What is the effect of changing the `numOutputs` and `sparsityTradeoff` parameters?

 <b>Notes:</b> The LCA model takes longer than Foldiak's model to run. Make sure you always set `numInputs` to a value that has an even square root. Finally, `numOutputs` should be assigned to a multiple of `numInputs`.

In [23]:
# General sparse coding parameters
numTrials = 1200 # Number of weight learning steps
numInputs = 64 # Number of input pixels, needs to have an even square root
numOutputs = 64 # Number of sparse coding neurons
sparsityTradeoff = 0.1 # Lambda parameter that determines how sparse the model will be
batchSize = 500 # How many image patches to include in batch
eta = 0.08 # Learning rate

# LCA specific parameters
tau = 50 # LCA update time constant
numSteps = 20 # Number of iterations to run LCA

# Plot display parameters
displayInterval = 50 # How often to update display plots during learning

In [24]:
assert numInputs%np.sqrt(numInputs) == 0, (
    "numInputs must have an even square root.")

# Load images and view them
dataset = np.load("data/IMAGES.npz")['arr_0']
[pixelsCols, pixelsRows, numImages] = dataset.shape
numPixels = pixelsCols * pixelsRows
dataset = dataset.reshape(numPixels, numImages)
dataset /= np.sqrt(np.var(dataset)) # We want the dataset to have variance=1

# Note: Here you can index any image, or just delete the [:,0] part and plot all images
pf.plotDataTiled(dataset[:,9], "Example Image Dataset");

<IPython.core.display.Javascript object>

In [11]:
def lcaSparsify(data, phi, tau, sparsityTradeoff, numSteps):
    """
    Compute sparse code of input data using the LCA

    Parameters
    ----------
    data : np.ndarray of dimensions (numInputs, batchSize) holding a batch of image patches
    phi : np.ndarray of dimensions (numInputs, numOutputs) holding sparse coding dictionary
    tau : float for setting time constant for LCA differential equation
    sparsityTradeoff : float indicating Sparse Coding lambda value (also LCA neuron threshold)
    numSteps: int indicating number of inference steps for the LCA model
    
    Returns
    -------
    a : np.ndarray of dimensions (numOutputs, batchSize) holding thresholded potentials
    """
    b = phi.T @ data # Driving input
    gramian = phi.T @ phi - np.identity(int(phi.shape[1])) # Explaining away matrix
    u = np.zeros_like(b) # Initialize membrane potentials to 0
    for step in range(numSteps):
        a = hf.lcaThreshold(u, sparsityTradeoff) # Activity vector contains thresholded membrane potentials
        du = b - gramian @ a - u # LCA dynamics define membrane update
        u += (1.0/tau) * du # Update membrane potentials using time constant
    return hf.lcaThreshold(u, sparsityTradeoff)

def lcaLearn(phi, patchBatch, sparseCode, learningRate):
    patchBatchRecon = phi @ sparseCode # Reconstruct input using the inferred sparse code
    reconError = patchBatch - patchBatchRecon # Error between the input and reconstruction
    dPhi = reconError @ sparseCode.T # Weight update rule (dE/dPhi)
    phi = phi + learningRate * dPhi # Scale weight update by learning rate
    return (phi, reconError)

In [25]:
# Initialize some empty arrays to hold network summary statistics
percNonzero = np.zeros(numTrials)
energy = np.zeros(numTrials)
reconQuality = np.zeros(numTrials)

# Initialize Phi weight matrix with random values
phi = hf.l2Norm(np.random.randn(numInputs, numOutputs))

# Do sparse coding with LCA
prevFig = pf.plotDataTiled(phi, "Dictionary at time step 0", None)
for trial in range(numTrials):
    # Make batch of random image patches
    patchBatch = np.zeros((numInputs, batchSize))
    for batchNum in range(batchSize):
        patchBatch[:, batchNum] = hf.getRandomPatch(dataset, int(np.sqrt(numInputs)))

    # Compute sparse code for batch of image patches
    sparseCode = lcaSparsify(patchBatch, phi, tau, sparsityTradeoff, numSteps)
    
    # Update weights using inferred sparse code
    learningRate = eta / batchSize
    (phi, reconError) = lcaLearn(phi, patchBatch, sparseCode, learningRate)
    
    # Renormalize phi matrix
    phi = hf.l2Norm(phi)
    
    # Record some stats for plotting
    (percNonzero[trial], energy[trial], reconQuality[trial]) = hf.computePlotStats(sparseCode, reconError, sparsityTradeoff)
    
    # Update dictionary plot
    if trial % displayInterval == 0:
        prevFig = pf.plotDataTiled(phi, "Dictionary at time step "+str(trial), prevFig)
    
# Plot learned dictionary
prevFig = pf.plotDataTiled(phi, "Dictionary at time step "+str(trial), prevFig)

<IPython.core.display.Javascript object>

In [26]:
# Plot learning summary statistics
dataList = [energy, percNonzero, reconQuality]
labelList = ["Energy", "% Non-Zero Activations", "Recon Quality pSNR dB"]
title = "Summary Statistics for LCA Sparse Coding"
pf.makeSubplots(dataList, labelList, title)

<IPython.core.display.Javascript object>

In [28]:
# Reconstruct an image
image = dataset[:,8]
imgPixels = image.size
numPatches = int(imgPixels/numInputs) # must divide evenly
patchBatch = image.reshape(numPatches, numInputs).T
sparseCode = lcaSparsify(patchBatch, phi, tau, sparsityTradeoff, numSteps)
reconBatch = phi @ sparseCode
reconImage = reconBatch.T.reshape(imgPixels)
imgAndRecon = np.vstack((image, reconImage)).T
pf.plotDataTiled(imgAndRecon, "Image and Corresponding Reconstruction");

<IPython.core.display.Javascript object>

<b>YOUR ANSWER HERE:</b> What is the effect of changing the sparsity tradeoff parameter, $\lambda$?

<font color="red">Solution: </font>

$\lambda$ adjusts the threshold on the neurons, which has a direct impact on the sparsity of the image codes. Fewer active neurons means it is more difficult to produce a good reconstruction. Therefore, as you increase $\lambda$ you should see the overall reconstruction quality go down. In addition to affecting the image code, when $\lambda$ is turned up high, the basis functions do not learn at a uniform rate. Which basis functions start learning initially is largely based on the random initial conditions.

The inferred image code can be improved by increasing the value of $\tau$ and increasing the number of time steps that LCA predicts over. Increasing `numSteps` allows the LCA model to take more time to settle to a more accurate solution and increasing $\tau$ results in a closer approximation to the desired differential equation solution.

Computing a mean weight update from a batch of images results in a more confident estimate of the desired gradient. As a result, as we increase the `batchSize` parameter, we can also take larger gradient steps. However, this is only true up to a certain limit that is defined by the true gradient that we are approximating. Additionally, it is sometimes beneficial to introduce noise into learning in the form of smaller batch sizes.

Now try increasing the size of the network to 128 (or more) output neurons and decreasing the size to 32 (or less) output neurons. How do the learned features change as you modify the degree of overcompleteness?

<b>YOUR ANSWER HERE:</b> What is the effect modifying the number of output neurons?

<font color="red">Solution: </font>

Adjusting the number of output neurons affects the degree of completeness between the basis set and the input pixel space. More neurons make it easier to satisfy both the reconstruction and sparsity constraints simultaneously.

In the complete case (`numInputs == numOutputs`), there is exactly one activity vector that perfectly reconstructs a given input, for any dictionary that spans the input space. However, it probably will not be sparse. The sparsity constraint acts as an information bottleneck, reducing reconstruction quality but making the code less redundant.

When the model is undercomplete (`numInputs > numOutputs`), we no longer have a guarantee that there exists an activity vector that perfectly reconstructs the input. Now, we have two information bottlenecks: one from the decrease in dimension and one from the sparsity constraint. The undercomplete dictionary is typically less diverse and the features start to resemble the principal components of the system. Typically, the result is worse reconstruction quality and low sparsity.

In the overcomplete regime (`numInputs < numOutputs`), there are an infinite number of activity vectors that can perfectly reconstruct a given input, many of which are decently sparse. Sparse codes with overcomplete dictionaries therefore achieve lower reconstruction error with greater sparsity. The dictionaries learned are often much more diverse. On natural images, non-Gabor type functions will become more prominent.

If the primary visual cortex really is doing something like sparse coding, we can ask what the completeness of the dictionary used might be. Estimates range from 100 to 10,000 x over-complete! If you'd like to know more, dictionary overcompleteness is explored in more detail in Professor Olshausen's 2013 SPIE paper titled "Highly Overcomplete Sparse Coding". [<a href=https://redwood.berkeley.edu/bruno/public/SPIE-overcomplete2013.pdf>Link</a>]