Skip to content

Commit

Permalink
Kraskov estimator (#199)
Browse files Browse the repository at this point in the history
  • Loading branch information
farkock committed Feb 28, 2019
1 parent 02b97f0 commit 6e39bcc
Show file tree
Hide file tree
Showing 2 changed files with 139 additions and 6 deletions.
123 changes: 118 additions & 5 deletions docs/2.literature_summary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,12 +86,125 @@ Literature summary
`Click here to open presentation as PDF document. <_static/on_the_information_bottleneck_theory_presentation.pdf>`_


5. Estimating mutual information
--------------------------------
:cite:`Kraskov2004`

5. SVCCA: singular vector canonical correlation analysis

5.1 Introduction
^^^^^^^^^^^^^^^^
- Kraskov suggests an alternative mutual information estimator that is not based
on binning but on k-nearest neighbour distances.

- Mutual information is often used as a measure of independence between random
variables. We note that mutual information is zero if and only if two random
variables are strictly independent.

- Mutual information has some well known properties and advantages since it has
close ties to Shannon entropy (see appendix of the paper), still estimating mutual
information is not always that easy.

- Most mutual information estimation techniques are based on binning, which often
leads to a systematic error.

- Consider a set of :math:`N` bivariate measurements, :math:`z_i = (x_i, y_i),
i = 1,...,N`, which are assumed to be iid (independent identically distributed)
realizations of a random variable :math:`Z=(X,Y)` with density :math:`\mu (x,y)`.
:math:`x` and :math:`y` can be scalars or elements of a higher dimensional space.

- For simplicity we say that :math:`0 \cdot \log(0) = 0` in order to consider probability
density functions that do not have to be strictly positive.

- The marginal densities of :math:`X` and :math:`Y` can be denoted as follows:

.. math::
\mu_x(x) = \int \mu (x,y) dy \ \text{and } \ \mu_y(y) = \int \mu (x,y) dx.
- Therefore we can define mutual information as

.. math::
I(x,y) = \int_Y \int_X \mu (x,y) \cdot \log \dfrac{\mu (x,y)}{\mu_x (x) \mu_y(y)} dx dy.
- Note that the base of the logarithm sets the unit in which information is measured.
That means that if we want to measure in bits, we have to take base 2. In the
following we will take the natural logarithm for estimating mutual information.

- Our aim is to estimate mutual information without any knowledge of the probability
functions :math:`\mu`, :math:`\mu_x` and :math:`\mu_y`. The only information we
have is set :math:`\{ z_i \}`.


5.2 Binning
^^^^^^^^^^^
- Binning is an often used technique to estimate mutual information. Therefore we
partition the supports of :math:`X` and :math:`Y` into bins of finite size by
considering the finite sum:

.. math::
I(X,Y) \approx I_{\text{binned}} (X,Y) \equiv \sum_{i,j} p(i,j) \log \dfrac{p(i,j)}{p_x(i)p_y(j)},
where :math:`p_x(i) = \int_i \mu_x (x) dx, p_y(j) = \int_j \mu_y(y)` and
:math:`p(i,j) = \int_i \int_j \mu (x,y) dx dy` (meaning :math:`\int_i` is the
integral over bin :math:`i`).

- Set :math:`n_x(i)` to be the number of points falling into bin i of :math:`X`
and analogous to that set :math:`n_y(j)` to be the number of points falling into
bin j of :math:`Y`. Moreover, :math:`n(i,j)` is the number of points in their
intersection.

- Since we do not know the exact probability density function, we approximate them
with :math:`p_x(i) \approx \frac{n_x(i)}{N}`, :math:`p_y(j) \approx \frac{n_y(j)}{N}`,
and :math:`p(i,j) \approx \frac{n(i,j)}{N}`.

- For :math:`N \rightarrow \infty` and bin sizes tending to zero, the binning
approximation (:math:`I_{\text{binned}}`) indeed converges to :math:`I(X,Y)`. Constraint: all
densities exist as proper functions.

- Note that the bin size do not have to be the same for each bin. Adaptive bin sizes
actually lead to much better estimations.

5.3 Kraskov estimator
^^^^^^^^^^^^^^^^^^^^^
- The Kraskov estimator uses k-nearest neighbour statistics to estimate mutual
information.

- The basic idea is to estimate :math:`H(X)` from the average distance to the
k-nearest neighbour, averaged over all :math:`x_i`.

- Since mutual information between two random variables can also be written as

.. math::
I(X,Y) = H(X) + H(Y) - H(X,Y),
with :math:`H(X)= - \int \mu (x) \log \mu (x) dx` being the Shannon entropy, we
can estimate the mutual information by estimating the Shannon entropy for
:math:`H(X)`, :math:`H(Y)` and :math:`H(X,Y)`.
This estimation would mean that the errors made in the individual estimates would
presumably not cancel. Therefore, we proceed a bit differently:

- Assume some metrics to be given on the spaces by :math:`X, Y` and :math:`Z=(X,Y)`.

- For each point :math:`z_i=(x_i,y_i)` we rank its neighbours by distance
:math:`d_{i,j} = ||z_i - z_j||: d_{i,j_1} \leq d_{i,j_2} \leq d_{i,j_3} \leq ...`.
Similar rankings can be done in the subspaces :math:`X` and :math:`Y`.

- Furthermore, we will use the maximum norm for the distances in the space
:math:`Z=(X,Y)`, i.e.

.. math::
||z-z'||_{\max} = \max \{ ||x - x'||, ||y - y'||\},
while any norms can be used for :math:`||x - x'||` and :math:`||y - y'||`.

- We make further notations: :math:`\frac{\epsilon (i)}{2}` is the distance between
:math:`z_i` and its :math:`k`-th neighbour.


6. SVCCA: singular vector canonical correlation analysis
--------------------------------------------------------
:cite:`Raghu2017`

5.1 Key points of the paper
6.1 Key points of the paper
^^^^^^^^^^^^^^^^^^^^^^^^^^^

- They developed a method that analyses each neuron's activation vector (i.e.
Expand All @@ -110,7 +223,7 @@ Literature summary
* It is fast to compute, which allows more comparisons to be calculated
than with previous methods.

5.2 Experiment set-up
6.2 Experiment set-up
^^^^^^^^^^^^^^^^^^^^^

- **Dataset**: mostly CIFAR-10 (augmented with random translations)
Expand All @@ -120,7 +233,7 @@ Literature summary
- In order to produce a few figures, they decided to design a toy regression task (training a four hidden layer fully connected network with 1D input and 4D output)


5.3 How SVCCA works
6.3 How SVCCA works
^^^^^^^^^^^^^^^^^^^

- SVCCA is short for Singular Vector Canonical Correlation Analysis and
Expand Down Expand Up @@ -159,7 +272,7 @@ Literature summary
.. math::
\bar{\rho} = \frac{1}{\min(m_1,m_2)} \sum_i \rho_i .
5.4 Results
6.4 Results
^^^^^^^^^^^

- The dimensionality of a layer's learned representation does not have to be the same number than the number of neurons in the layer.
Expand Down
22 changes: 21 additions & 1 deletion docs/references.bib
Original file line number Diff line number Diff line change
@@ -1,13 +1,33 @@
%% This BibTeX bibliography file was created using BibDesk.
%% http://bibdesk.sourceforge.net/
%% Created for Farina Kock at 2018-10-31 11:21:09 +0100
%% Created for Farina Kock at 2018-12-12 11:20:06 +0100
%% Saved with string encoding Unicode (UTF-8)
@article{Kraskov2004,
Adsnote = {Provided by the SAO/NASA Astrophysics Data System},
Adsurl = {https://ui.adsabs.harvard.edu/\#abs/2004PhRvE..69f6138K},
Archiveprefix = {arXiv},
Author = {{Kraskov}, Alexander and {St{\"o}gbauer}, Harald and {Grassberger}, Peter},
Date-Added = {2018-12-12 11:19:19 +0100},
Date-Modified = {2018-12-12 11:20:05 +0100},
Doi = {10.1103/PhysRevE.69.066138},
Eid = {066138},
Eprint = {cond-mat/0305641},
Journal = {\pre},
Keywords = {05.90.+m, 02.50.-r, 87.10.+e, Other topics in statistical physics thermodynamics and nonlinear dynamical systems, Probability theory stochastic processes and statistics, General theory and mathematical aspects, Condensed Matter - Statistical Mechanics, Condensed Matter - Disordered Systems and Neural Networks},
Month = Jun,
Pages = {066138},
Primaryclass = {cond-mat.stat-mech},
Title = {{Estimating mutual information}},
Volume = {69},
Year = 2004,
Bdsk-Url-1 = {https://doi.org/10.1103/PhysRevE.69.066138}}

@inproceedings{Saxe2018,
Abstract = {The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case. Through a combination of analytical results and simulation, we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like tanh yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.},
Author = {Andrew Michael Saxe and Yamini Bansal and Joel Dapello and Madhu Advani and Artemy Kolchinsky and Brendan Daniel Tracey and David Daniel Cox},
Expand Down

0 comments on commit 6e39bcc

Please sign in to comment.