Skip to content

Commit

Permalink
Literature sum/svcca (#183)
Browse files Browse the repository at this point in the history
* Added svcca summary to literature review

* Changing restructured text compiling errors

* Formatting restructured text

* Added notes on experiment result and included bibliography

* Adding bibtex entries to literature_summary

* Formatting literature references and adding Philipps PDF document to repo
  • Loading branch information
farkock committed Oct 31, 2018
1 parent b86059d commit de73281
Show file tree
Hide file tree
Showing 3 changed files with 195 additions and 135 deletions.
Binary file not shown.
199 changes: 116 additions & 83 deletions docs/literature_summary.rst
Original file line number Diff line number Diff line change
@@ -1,135 +1,168 @@
Literature Summary
Literature summary
==================

1. THE INFORMATION BOTTLENECK METHOD (Tishby 1999)
1. The information bottleneck method (Tishby 1999)
--------------------------------------------------
Tishby, N., Pereira, F. C., & Bialek, W. (2000). The information bottleneck method. arXiv preprint physics/0004057.
:cite:`Tishby2000`

1.1. Glossary
^^^^^^^^^^^^^

1.2. Structure
^^^^^^^^^^^^^^

1.3. Criticisim
^^^^^^^^^^^^^^^

1.4. Todo List
^^^^^^^^^^^^^^



2. DEEP LEARNING AND THE INFORMATION BOTTLENECK PRINCIPLE (Tishby 2015)
2. Deep learning and the information bottleneck principle (Tishby 2015)
-----------------------------------------------------------------------
Tishby, N., & Zaslavsky, N. (2015, April). Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE (pp. 1-5). IEEE.

2.1. Glossary
^^^^^^^^^^^^^

2.2. Structure
^^^^^^^^^^^^^^
:cite:`Tishby2015`

2.3. Criticisim
^^^^^^^^^^^^^^^

2.4. Todo List
^^^^^^^^^^^^^^



3. OPENING THE BLACK BOX OF DEEP NEURAL NETWORKS VIA INFORMATION (Tishby 2017)
3. Opening the black box of deep neural networks via information (Tishby 2017)
------------------------------------------------------------------------------
Shwartz-Ziv, R., & Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.

3.1. Glossary
^^^^^^^^^^^^^

3.2. Structure
^^^^^^^^^^^^^^

3.3. Criticisim
^^^^^^^^^^^^^^^

3.4. Todo List
^^^^^^^^^^^^^^
:cite:`Schwartz-ziv2017`


4. ON THE INFORMATION BOTTLENECK THEORY OF DEEP LEARNING
--------------------------------------------------------

Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2018, May). On the information bottleneck theory of deep learning. In International Conference on Learning Representations.
4. On the information bottleneck theory of deep learning (Saxe 2018)
--------------------------------------------------------------------
:cite:`Saxe2018`

Key Points of the paper:
^^^^^^^^^^^^^^^^^^^^^^^^

4.1 Key points of the paper
^^^^^^^^^^^^^^^^^^^^^^^^^^^

* none of the following claims of Tishby (:cite:`Tishby2015`) holds in the general case:

* none of the following claims of Tishby (:cite:`Tishby`) holds in the general case:

#. deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase
#. the compression phase is causally related to the excellent generalization performance of deep networks
#. the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent

|
* the osberved compression is different based on the activation function: double-sided saturating nonlinearities like tanh
* the observed compression is different based on the activation function: double-sided saturating nonlinearities like tanh
yield a compression phase, but linear activation functions and single-sided saturating nonlinearities like ReLU do not.

|
* there is no evident causal connection between compression and generalization.

|
* the compression phase, when it exists, does not arise from stochasticity in training.

|
* when an input domain consists of a subset of task-relevant and task-irrelevant information, the task-irrelevant information compress
although the overall information about the input may monotonically increase with training time. This compression happens concurrently
although the overall information about the input may monotonically increase with training time. This compression happens concurrently
with the fitting process rather than during a subsequent compression period.

|
4.2 Most important experiments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Tishby's experiment reconstructed:

Most important Experiments:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
#. Tishby's experiment reconstructed:

* 7 fully connected hidden layers of width 12-10-7-5-4-3-2
* 7 fully connected hidden layers of width 12-10-7-5-4-3-2
* trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
* 256 randomly selected samples per batch
* mutual information is calculated by binning th output activations into 30 equal intervals between -1 and 1
* trained on Tishby dataset
* mutual information is calculated by binning the output activations into 30 equal intervals between -1 and 1
* trained on Tishby's dataset
* tanh-activation function

#. Tishby's experiment reconstructed with ReLu activation:
#. Tishby's experiment reconstructed with ReLU activation:

* 7 fully connected hidden layers of width 12-10-7-5-4-3-2
* 7 fully connected hidden layers of width 12-10-7-5-4-3-2
* trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
* 256 randomly selected samples per batch
* mutual information is calculated by binning th output activations into 30 equal intervals between -1 and 1
* mutual information is calculated by binning the output activations into 30 equal intervals between -1 and 1
* ReLu-activation function

#. Tanh-activation function on MNIST:
#. Tanh-activation function on MNIST:

* 6 fully connected hidden layers of width 784 - 1024 - 20 - 20 - 20 - 10
* 6 fully connected hidden layers of width 784 - 1024 - 20 - 20 - 20 - 10
* trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
* non-parametric kernel density mutual information estimator
* trained on MNIST dataset
* tanh-activation function

#. ReLu-activation function on MNIST:
#. ReLU-activation function on MNIST:

* 6 fully connected hidden layers of width 784 - 1024 - 20 - 20 - 20 - 10
* 6 fully connected hidden layers of width 784 - 1024 - 20 - 20 - 20 - 10
* trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
* non-parametric kernel density mutual information estimator
* trained on MNIST dataset
* ReLu-activation function
* ReLU-activation function

4.3 Presentation
^^^^^^^^^^^^^^^^

`Click here to open presentation as PDF document. <_static/on_the_information_bottleneck_theory_presentation.pdf>`_


5. SVCCA: singular vector canonical correlation analysis
--------------------------------------------------------
:cite:`Raghu2017`

5.1 Key points of the paper
^^^^^^^^^^^^^^^^^^^^^^^^^^^

- They developed a method that analyses each neuron's activation vector (i.e.
the scalar outputs that are emitted on input data points). This analysis gives an
insight into learning dynamics and learned representation.

- SVCCA is a general method that compares two learned representations of
different neural network layers and architectures. It is either possible to
compare the same layer at different time steps, or simply different layers.

- The comparison of two representations fulfills two important properties:

* It is invariant to affine transformation (which allows the comparison
between different layers and networks).

* It is fast to compute, which allows more comparisons to be calculated
than with previous methods.

5.2 Experiment set-up
^^^^^^^^^^^^^^^^^^^^^

- **Dataset**: mostly CIFAR-10 (augmented with random translations)

- **Architecture**: One convolutional network and one residual network

- In order to produce a few figures, they decided to design a toy regression task (training a four hidden layer fully connected network with 1D input and 4D output)


5.3 How SVCCA works
^^^^^^^^^^^^^^^^^^^

- SVCCA is short for Singular Vector Canonical Correlation Analysis and
therefore combines the Singular Value Decomposition with a Canonical Correlation
Analysis.

- The representation of a neuron is defined as a table/function that maps the
inputs on all possible outputs for a single neuron. Its representation is
therefore studied as a set of responses over a finite set of inputs. Formally,
that means that given a dataset :math:`X = {x_1,...,x_m}` and a neuron :math:`i`
on layer :math:`l`, we define :math:`z^{l}_{i}` to be the vector of outputs on
:math:`X`, i.e.

.. math::
z^{l}_{i} = (z^{l}_{i}(x_1),··· ,z^{l}_{i}(x_m)).
Note that :math:`z^{l}_{i}` is a single neuron's response over the entire
dataset and not an entire layer's response for a single input. In this sense
the neuron can be thought of as a single vector in a high-dimensional space.
A layer is therefore a subspace of :math:`\mathbb{R}^m` spanned by its neurons'
vectors.

1. **Input**: takes two (not necessarily different) sets of neurons (typically layers of a network)

.. math::
l_1 = {z^{l_1}_{1}, ..., z^{l_{m_1}}_{l_1}} \text{ and } l_2 = {z^{l_2}_{1}, ..., z^{l_{m_2}}_{l_2}}
2. **Step 1**: Use SVD of each subspace to get sub-subspaces :math:`l_1' \in l_1` and :math:`l_2' \in l_2`, which contain of the most important directions of the original subspaces :math:`l_1, l_2`.

3. **Step 2**: Compute Canonical Correlation similarity of :math:`l_1', l_2'`: linearly transform :math:`l_1', l_2'` to be as aligned as possible and compute correlation coefficients.

4. **Output**: pairs of aligned directions :math:`(\widetilde{z}_{i}^{l_1}, \widetilde{z}_{i}^{l_2})` and how well their correlate :math:`\rho_i`. The SVCCA similarity is defined as

.. math::
\bar{\rho} = \frac{1}{\min(m_1,m_2)} \sum_i \rho_i .
5.4 Results
^^^^^^^^^^^

- The dimensionality of a layer's learned representation does not have to be the same number than the number of neurons in the layer.

Presentation:
^^^^^^^^^^^^^
- Because of a bottom up convergence of the deep learning dynamics, they suggest a computationally more efficient method for training the network - *Freeze Training*. In Freeze Training layers are sequentially frozen after a certain number of time steps.

`Google slides link <https://docs.google.com/presentation/d/1tB-TkvULUd4QvVn5ClDRDko6q8Y1EOdaZnTX3eGtxVc/edit?usp=sharing>`_
- Computational speed up is successfully done with a Discrete Fourier Transform causing all block matrices to be block-diagonal.

- Moreover, SVCCA captures the semantics of different classes, with similar classes having similar sensitivities, and vice versa.


.. bibliography:: references.bib

0 comments on commit de73281

Please sign in to comment.