# Visualization methods

## Deconvolution

M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” presented at the Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, 2014, vol. 8689, pp. 818–833.

Notes: a very popular and useful method for examining what a CNN learns.

I think the method of deconvolution is better described in CS231n <http://cs231n.stanford.edu/slides/2016/winter1516_lecture9.pdf>. Essentially, we do backprop for all layers, except for ReLU, where negative diff is relu'ed (zeroed out).

However, if you check the caffe for deep visualization toolbox, to see how deconv is done. <https://github.com/BVLC/caffe/compare/master...yosinski:deconv-deep-vis-toolbox>

For LRN, it's ignored; for ReLU, it's another ReLU.

In addition, it sets an example using visualization to guide architecture design. See Section 4.1. The "aliasing artifacts caused by the large stride..." actually means those blocky patterns in third panel of Figure 5 (it can be c or d depending whether you read label in the figure, or in the caption). Here, ther's only one (first row, second column). But in author's slides, there are more. It's from <http://videolectures.net/eccv2014_zeiler_convolutional_networks/>

![comparison with alexnet](./_deconv/comparison_with_alexnet.png)

Fig. 5 is wrong. Check arxiv version (v3, Fig 6 there) <https://arxiv.org/pdf/1311.2901.pdf>.

![correct fig 5](./_deconv/arxiv_v3_fig6.png)

Some details.

End of Section 3, they say they normalize filter weight size. I don't think this is needed in general. Maybe just happens for their models. Or maybe people simply ignore this, and doing this normalization can increase some performance, as needed by ImageNet challenge.





~~~
@inproceedings{Zeiler:2014fr,
author = {Zeiler, Matthew D and Fergus, Rob},
title = {{Visualizing and Understanding Convolutional Networks}},
booktitle = {Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I},
year = {2014},
editor = {Fleet, David J and Pajdla, Tom{\'a}s and Schiele, Bernt and Tuytelaars, Tinne},
pages = {818--833},
publisher = {Springer},
annote = {a very popular and useful method for examining what a CNN learns.

I think the method of deconvolution is better described in CS231n <http://cs231n.stanford.edu/slides/2016/winter1516_lecture9.pdf>. Essentially, we do backprop for all layers, except for ReLU, where negative diff is relu'ed (zeroed out).


However, if you check the caffe for deep visualization toolbox, to see how deconv is done. <https://github.com/BVLC/caffe/compare/master...yosinski:deconv-deep-vis-toolbox>

For LRN, it's ignored; for ReLU, it's another ReLU.

In addition, it sets an example using visualization to guide architecture design. See Section 4.1. The "aliasing artifacts caused by the large stride..." actually means those blocky patterns in third panel of Figure 5 (it can be c or d depending whether you read label in the figure, or in the caption). Here, ther's only one (first row, second column). But in author's slides, there are more. It's from <http://videolectures.net/eccv2014_zeiler_convolutional_networks/>

![comparison with alexnet](./deconv/comparison_with_alexnet.png)

Fig. 5 is wrong. Check arxiv version (v3, Fig 6 there) <https://arxiv.org/pdf/1311.2901.pdf>.

![correct fig 5](./deconv/arxiv_v3_fig6.png)

Some details.

End of Section 3, they say they normalize filter weight size. I don't think this is needed in general.
},
keywords = {classics, deep learning},
doi = {10.1007/978-3-319-10590-1_53},
language = {English},
read = {Yes},
rating = {5},
date-added = {2017-02-16T19:50:02GMT},
date-modified = {2017-03-27T00:55:38GMT},
abstract = {Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky et al. [18]. However there is no clear understanding of why the},
url = {http://dx.doi.org/10.1007/978-3-319-10590-1_53},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Zeiler/ECCV%202014%20Part%20I%202014%20Zeiler.pdf},
file = {{ECCV 2014 Part I 2014 Zeiler.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2014/Zeiler/ECCV 2014 Part I 2014 Zeiler.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1007/978-3-319-10590-1_53}}
}
~~~

## Taylor expansion around optimal stimulus.

P. Berkes and L. Wiskott, “On the Analysis and Interpretation of Inhomogeneous Quadratic Forms as Receptive Fields,” Neural Computation, vol. 18, no. 8, pp. 1868–1895, Aug. 2006.

Notes: I only read it upt to Section 5 (inlcuded), as I care more about visualization, not about statistical test.

essential contributions.

1. General methods to solve Eq. (4.1). That is, maximizing/minimizing a inhomogeneous quadratic form (which can model complex V1 cells), under a norm constraint.

2. Eq. (5.5). given optimal stimulus, find the change of firing rate along all directions orthgonal to the direction of optimal stimulus (Eq. (5.10)). For a demo of this, see
    * [Visualization of optimal stimuli and invariances 
for Tiled Convolutional Neural Networks](http://ai.stanford.edu/~quocle/TCNNweb/)
    * [original code](http://people.brandeis.edu/~berkes/software/qforms-tk/index.html)
    * [Yimeng's implementation](https://github.com/leelabcnbc/tang-paper-2017/blob/master/neuron_fitting/debug/debug_hessian_visualization_complex_cell.ipynb)

There are some errors and points note taking in the paper about the method.

1. Above (5.8), we actually need to compute the null space of $x^+$ (N-1 vectors orthogonal to $x^+$). Check Quoc Le and my implementation on how to do this. Doing Gram-Schmidt on $x^+$ plus $e_1$ through $e_{N-1}$ doesn't necessarily give the correct result.
2. As $x^+$ is local maxima, then you would expect along all directions, (5.5) gives you negative values. This is true, but eigenvalues of Eq. (5.8) are not all negative. Instead, you need to add back the offset term in Eq. (5.5). Then all offsetted eigenvalues are negative. Least negative ones are most invariant, and vice versa. Check original implmentation and Yimeng's implementation (`best_variance_and_invariance_directions `).
3. Eq. (5.8) is simply computing the eigenvalues of H, along directions othorgonal to $x^+$. That is, they paramterize $w$ with $Bx$, $B$ being a $N$ by $N-1$ basis matrix, and $x$ being a $N-1$ vector. This justifies why using Eq. (5.9) to recover the original space.


Some caveats

For this to work, you need to 1) really find the optimal stimulus (at least some local maxima), and 2) compute the hessian correctly. Both are pretty difficult for a really complex neuron, say CNN.

1) is either not very accurate, or it gives noise like input (from my experience, images that excite a aritifical neuron most are mostly noise like).
2) is either theoretically impossbie (see <https://github.com/leelabcnbc/tang-paper-2017/blob/master/neuron_fitting/debug/debug_hessian_old_plus_adam_vs_lbfgs.ipynb>), or it takes so long to compute (say, using TensorFlow).

Yimeng found that this method doesn't work well with really complex neurons. See <https://github.com/leelabcnbc/tang-paper-2017/blob/master/neuron_fitting/debug/cnn_fitting_debug_nonOT_visualize.ipynb>

In the paper, they are not analyzing some fitted V1 cells. Instead, they are analzying some artificial units learned by Slow Feature Analysis. Check Section 2. According to Deep Learning book, SFA has closed form solution, so it can't be too complicated.


Last paragraph of Section 4.  optimal x might not make sense or be relevant.

> Note that although $x^+$ is the stimulus that elicits the strongest response in the function, it does not necessarily mean that it is representative of the class of stimuli that give the most important contribution to its output. This depends on the distribution of the input vectors.


~~~
@article{Berkes:2006el,
author = {Berkes, Pietro and Wiskott, Laurenz},
title = {{On the Analysis and Interpretation of Inhomogeneous Quadratic Forms as Receptive Fields}},
journal = {Neural Computation},
year = {2006},
volume = {18},
number = {8},
pages = {1868--1895},
month = aug,
annote = {I only read it upt to Section 5 (inlcuded), as I care more about visualization, not about statistical test.

essential contributions.

1. General methods to solve Eq. (4.1). That is, maximizing/minimizing a inhomogeneous quadratic form (which can model complex V1 cells), under a norm constraint.

2. Eq. (5.5). given optimal stimulus, find the change of firing rate along all directions orthgonal to the direction of optimal stimulus (Eq. (5.10)). For a demo of this, see
    * [Visualization of optimal stimuli and invariances 
for Tiled Convolutional Neural Networks](http://ai.stanford.edu/{\textasciitilde}quocle/TCNNweb/)
    * [original code](http://people.brandeis.edu/~berkes/software/qforms-tk/index.html)
    * [Yimeng's implementation](https://github.com/leelabcnbc/tang-paper-2017/blob/master/neuron_fitting/debug/debug_hessian_visualization_complex_cell.ipynb)

There are some errors and points note taking in the paper about the method.

1. Above (5.8), we actually need to compute the null space of $x^+$ (N-1 vectors orthogonal to $x^+$). Check Quoc Le and my implementation on how to do this. Doing Gram-Schmidt on $x^+$ plus $e_1$ through $e_{N-1}$ doesn't necessarily give the correct result.
2. As $x^+$ is local maxima, then you would expect along all directions, (5.5) gives you negative values. This is true, but eigenvalues of Eq. (5.8) are not all negative. Instead, you need to add back the offset term in Eq. (5.5). Then all offsetted eigenvalues are negative. Least negative ones are most invariant, and vice versa. Check original implmentation and Yimeng's implementation (`best_variance_and_invariance_directions `).
3. Eq. (5.8) is simply computing the eigenvalues of H, along directions othorgonal to $x^+$. That is, they paramterize $w$ with $Bx$, $B$ being a $N$ by $N-1$ basis matrix, and $x$ being a $N-1$ vector. This justifies why using Eq. (5.9) to recover the original space.


Some caveats

For this to work, you need to 1) really find the optimal stimulus (at least some local maxima), and 2) compute the hessian correctly. Both are pretty difficult for a really complex neuron, say CNN.

1) is either not very accurate, or it gives noise like input (from my experience, images that excite a aritifical neuron most are mostly noise like).
2) is either theoretically impossbie (see <https://github.com/leelabcnbc/tang-paper-2017/blob/master/neuron_fitting/debug/debug_hessian_old_plus_adam_vs_lbfgs.ipynb>), or it takes so long to compute (say, using TensorFlow).

Yimeng found that this method doesn't work well with really complex neurons. See <https://github.com/leelabcnbc/tang-paper-2017/blob/master/neuron_fitting/debug/cnn_fitting_debug_nonOT_visualize.ipynb>

In the paper, they are not analyzing some fitted V1 cells. Instead, they are analzying some artificial units learned by Slow Feature Analysis. Check Section 2. According to Deep Learning book, SFA has closed form solution, so it can't be too complicated.


Last paragraph of Section 4.  optimal x might not make sense or be relevant.

> Note that although $x^+$ is the stimulus that elicits the strongest response in the function, it does not necessarily mean that it is representative of the class of stimuli that give the most important contribution to its output. This depends on the distribution of the input vectors.

},
publisher = {MIT Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA journals-info@mit.edu},
keywords = {V1},
doi = {10.1162/neco.2006.18.8.1868},
language = {English},
read = {Yes},
rating = {4},
date-added = {2017-05-29T17:19:56GMT},
date-modified = {2017-06-12T15:26:03GMT},
abstract = {In this letter, we introduce some mathematical and numerical tools to analyze and interpret inhomogeneous quadratic forms. The resulting characterization is in some aspects similar to that given by experimental studies of cortical cells, making it particularly suitable for application to second-order approximations and theoretical models of physiological receptive fields. We first discuss two ways of analyzing a quadratic form by visualizing the coefficients of its quadratic and linear term directly and by considering the eigenvectors of its quadratic term. We then present an algorithm to compute the optimal excitatory and inhibitory stimuli{\textemdash}those that maximize and minimize the considered quadratic form, respectively, given a fixed energy constraint. The analysis of the optimal stimuli is completed by considering their invariances, which are the transformations to which the quadratic form is most insensitive, and by introducing a test to determine which of these are statistically significant. Next we prop...},
url = {http://www.mitpressjournals.org/doi/10.1162/neco.2006.18.8.1868},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2006/Berkes/Neural%20Computation%202006%20Berkes.pdf},
file = {{Neural Computation 2006 Berkes.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2006/Berkes/Neural Computation 2006 Berkes.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1162/neco.2006.18.8.1868}}
}
~~~

# Adversarial Examples

## first paper on adversarial examples

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv, vol. cs.CV, Dec. 2013.

Notes: Good analysis paper, showing that 1) at higher layers, units or mixture of units are roughtly the same. BUT, I'm not sure if this is still the case for networks with only one FC layer (only fc8, in AlexNet's term) 2) adversarial examples. Their root reason might be that each layer is not stable (Table 5).



Section 4

> a non-local generalization prior over the input space. In other words, it is assumed that is possible for the output unit to assign non- significant (and, presumably, non-epsilon) probabilities to regions of the input space that contain no training examples in their vicinity.

this is also mentioned in DL book. DL models assumes some non-local prior.


Section 4.1 here they assume images pixels have range 0 to 1, not 0-255.

Here D(x,f(x)) should be x, not f(x). clearly f(x) doesn't makes dimension match.


Here how they actually find an adversarial example. is really vague. It's explained more clearly in [Exploring the Space of Adversarial Images](https://arxiv.org/pdf/1510.05328.pdf)  (version 5 on ArXiv) with code at <https://github.com/tabacof/adversarial>. Later referred to "the 2015 paper". I copied relevant page in the notes. Check [here](./_adversarial/1510.05328_algo.pdf)

Essentially, you need to first select an wrong label l to fool the network, and then you start from a small C (this paper), or big C (the 2015 paper), and then surely you will find a somewhat big r that will satisfy that f(x+r)=l. Then you decrease C little by little, until that f(x+r) != l. This process will intuitively decrease r, yet biasing the r to those preserving f(x+r)=l. Intuitively makes sense. Not sure how good it is in theory.


> This penalty function method would yield the exact solution for D(X,l) in the case of convex losses, however neural networks are non-convex in general, so we end up with an approximation in this case.

I don't think this is right. At least I can't derive it using my RUBBISH note.



======RUBBISH START======

I don't understand the math of box-constrained L-BFGS. I think here it's using the correspondence between Constrained and Lagrange forms. See text around Eq. 5.7 and 5.8 of [Statistical Learning with Sparsity](https://trevorhastie.github.io/): "For convex programs, the Lagrangian allows for the constrained problem (5.5) to be solved by reduction to an equivalent unconstrained problem.", or page 16-17 of <http://www.stat.cmu.edu/~ryantibs/convexopt-S15/lectures/12-kkt.pdf>

Here I would say the correspondence is loose. to make it more precise, we should modify the constraint f(x+r)=l in the first (original) problem to be something like loss(x+r,l)<0.001.

Two problems, written as constrained form, and lagrange form, are

1) constrained form.

min |r|, s.t. loss(x+r,l) < \eps

2) lagrange form

min |r| + \lambda loss(x+r,l).

First, we replace hard constraint f(x+r)=l with a more soft constraint using loss (I would say the paper is poorly written; no relationship between f and loss_f is mentioned).

Then they also assume that, big eps correspond to small lambda, and small eps correspond to big lambda.    This assumption is also kind of assumed in Section 7.2 of Deep Learning book. Check my notes on that.

======RUBBISH END======




Section 4.2

> A subtle, but essential detail is that we only got improvements by generating adversarial examples for each layer outputs which were used to train all the layers above. The network was trained in an alternating fashion, maintain- ing and updating a pool of adversarial examples for each layer separately in addition to the original training set.

For adversarial example to be useful, you can't use input-level adversarial examples. Instead, you need intermediate adversarial examples. But this would really make training complicated.

Section 4.3

Not sure about the math about deriving operator norm of W when W is convolution. But whatever... Conclusion is important.

* p.1: First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. -- Highlighted Mar 29, 2017
* p.1: In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input. -- Highlighted Mar 29, 2017
* p.1: Instead, we show in section 3 that random projections of φ(x) are semantically indistinguishable from the coordinates of φ(x). This puts into question the conjecture that neural networks disentangle variation factors across coordinates. Generally, it seems that it is the entire space of activations, rather than the individual units, that contains the bulk of the semantic information. -- Highlighted Mar 29, 2017
* p.2: That is, if we use one neural net to generate a set of adversarial examples, we find that these examples are still statistically hard for another neural network even when it was trained with different hyperparameters or, most surprisingly, when it was trained on a different set of examples. -- Highlighted Mar 29, 2017
* p.3: Our experiments show that any random direction v ∈ Rn gives rise to similarly interpretable semantic properties. More formally, we find that images x′ are semantically related to each other, for many x′ such thatx′ =argmax⟨φ(x),v⟩ -- Highlighted Mar 29, 2017
* p.3: This suggests that the natural basis is not better than a random basis for inspecting the properties of φ(x). This puts into question the notion that neural networks disentangle variation factors across coordinates. -- Highlighted Mar 29, 2017
* p.3: Although such analysis gives insight on the capacity of φ to generate invariance on a particular subset of the input distribution, it does not explain the behavior on the rest of its domain. We shall see in the next section that φ has counterintuitive properties in the neighbourhood of almost every point form data distribution. -- Highlighted Mar 29, 2017
* p.4: a non-local generalization prior over the input space. In other words, it is assumed that is possible for the output unit to assign nonsignificant (and, presumably, non-epsilon) probabilities to regions of the input space that contain no training examples in their vicinity. -- Highlighted Mar 29, 2017
* p.4: It is implicit in such arguments that local generalization—in the very proximity of the training examples—works as expected. -- Highlighted Mar 29, 2017
* p.5: These deformations are, however, statistically inefficient, for a given example: they are highly correlated and are drawn from the same distribution throughout the entire training of the model. We propose a scheme to make this process adaptive in a way that exploits the model and its deficiencies in modeling the local space around the training data. -- Highlighted Mar 29, 2017
* p.5: Cross model generalization: a relatively large fraction of examples will be misclassified by networks trained from scratch with different hyper-parameters (number of layers, regularization or initial weights) -- Highlighted Mar 29, 2017
* p.6: A subtle, but essential detail is that we only got improvements by generating adversarial examples for each layer outputs which were used to train all the layers above. The network was trained in an alternating fashion, maintaining and updating a pool of adversarial examples for each layer separately in addition to the original training set. -- Highlighted Mar 29, 2017
* p.8: The intriguing conclusion is that the adversarial examples remain hard for models trained even on a disjoint training set, although their effectiveness decreases considerably. -- Highlighted Mar 29, 2017
* p.9: It results that a conservative measure of the unstability of the network can be obtained by simply computing the operator norm of each fully connected and convolutional layer. -- Highlighted Mar 29, 2017
* p.9: but they don’t attempt to explain why these examples generalize across different hyperparameters or training sets. -- Highlighted Mar 29, 2017
* p.9: This suggests a simple regularization of the parameters, consisting in penalizing each upper Lipschitz bound, which might help improve the generalisation error of the networks. -- Highlighted Mar 29, 2017
* p.10: Indeed, if the network can generalize well, how can it be confused by these adversarial negatives, which are indistinguishable from the regular examples? Possible explanation is that the set of adversarial negatives is of extremely low probability, and thus is never (or rarely) observed in the test set, yet it is dense (much like the rational numbers), and so it is found near every virtually every test case. However, we don’t have a deep understanding of how often adversarial negatives appears, and thus this issue should be addressed in a future research. -- Highlighted Mar 29, 2017

~~~
@article{Szegedy:2013vw,
author = {Szegedy, C and Zaremba, W and Sutskever, I and Bruna, J and Erhan, D and Goodfellow, I and Fergus, Rob},
title = {{Intriguing properties of neural networks}},
journal = {ArXiv e-prints},
year = {2013},
volume = {cs.CV},
month = dec,
annote = {Good analysis paper, showing that 1) at higher layers, units or mixture of units are roughtly the same. BUT, I'm not sure if this is still the case for networks with only one FC layer (only fc8, in AlexNet's term) 2) adversarial examples. Their root reason might be that each layer is not stable (Table 5).



Section 4

> a non-local generalization prior over the input space. In other words, it is assumed that is possible for the output unit to assign non- significant (and, presumably, non-epsilon) probabilities to regions of the input space that contain no training examples in their vicinity.

this is also mentioned in DL book. DL models assumes some non-local prior.





Section 4.1 here they assume images pixels have range 0 to 1, not 0-255.

Here D(x,f(x)) should be x, not f(x). clearly f(x) doesn't makes dimension match.


Here how they actually find an adversarial example. is really vague. It's explained more clearly in [Exploring the Space of Adversarial Images](https://arxiv.org/pdf/1510.05328.pdf)  (version 5 on ArXiv) with code at <https://github.com/tabacof/adversarial>. Later referred to "the 2015 paper". I copied relevant page in the notes.

Essentially, you need to first select an wrong label l to fool the network, and then you start from a small C (this paper), or big C (the 2015 paper), and then surely you will find a somewhat big r that will satisfy that f(x+r)=l. Then you decrease C little by little, until that f(x+r) != l. This process will intuitively decrease r, yet biasing the r to those preserving f(x+r)=l. Intuitively makes sense. Not sure how good it is in theory.


> This penalty function method would yield the exact solution for D(X,l) in the case of convex losses, however neural networks are non-convex in general, so we end up with an approximation in this case.

I don't think this is right. At least I can't derive it using my RUBBISH note.



======RUBBISH START======

I don't understand the math of box-constrained L-BFGS. I think here it's using the correspondence between Constrained and Lagrange forms. See text around Eq. 5.7 and 5.8 of [Statistical Learning with Sparsity](https://trevorhastie.github.io/): "For convex programs, the Lagrangian allows for the constrained problem (5.5) to be solved by reduction to an equivalent unconstrained problem.", or page 16-17 of <http://www.stat.cmu.edu/{\textasciitilde}ryantibs/convexopt-S15/lectures/12-kkt.pdf>

Here I would say the correspondence is loose. to make it more precise, we should modify the constraint f(x+r)=l in the first (original) problem to be something like loss(x+r,l)<0.001.

Two problems, written as constrained form, and lagrange form, are

1) constrained form.

min |r|, s.t. loss(x+r,l) < \eps

2) lagrange form

min |r| + \lambda loss(x+r,l).

First, we replace hard constraint f(x+r)=l with a more soft constraint using loss (I would say the paper is poorly written; no relationship between f and loss_f is mentioned).

Then they also assume that, big eps correspond to small lambda, and small eps correspond to big lambda.    This assumption is also kind of assumed in Section 7.2 of Deep Learning book. Check my notes on that.

======RUBBISH END======




Section 4.2

> A subtle, but essential detail is that we only got improvements by generating adversarial examples for each layer outputs which were used to train all the layers above. The network was trained in an alternating fashion, maintain- ing and updating a pool of adversarial examples for each layer separately in addition to the original training set.

For adversarial example to be useful, you can't use input-level adversarial examples. Instead, you need intermediate adversarial examples. But this would really make training complicated.

Section 4.3

Not sure about the math about deriving operator norm of W when W is convolution. But whatever... Conclusion is important.},
keywords = {deep learning, To Read},
read = {Yes},
rating = {4},
date-added = {2017-03-29T19:52:30GMT},
date-modified = {2017-03-30T14:48:37GMT},
url = {http://arxiv.org/abs/1312.6199},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Szegedy/arXiv%202013%20Szegedy.pdf},
file = {{arXiv 2013 Szegedy.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2013/Szegedy/arXiv 2013 Szegedy.pdf:application/pdf}},
uri = {\url{papers3://publication/uuid/C6374003-66D2-4059-941E-D3B1E4F50BBB}}
}
~~~