# Paper | Dynamic Routing Between Capsules_Notes

> [论文链接：[1710.09829] Dynamic Routing Between Capsules](https://arxiv.org/abs/1710.09829) | Sara Sabour Nicholas Frosst Geoffrey E. Hinton Google Brain Toronto {sasabour, frosst, geoffhinton}@google.com

此文是本人论文学习之路开篇，在前辈们的研究基础上学习，旨在加强自身论文阅读能力，核心算法实现能力，项目工程代码能力等，若有错误，还望批评指教。——ZJ


## Dynamic Routing Between Capsules (胶囊间的动态路由)

### Abstract （摘要）

A capsule is a group of neurons whose activity vector represents the instantiation （实例化）parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

本论文所研究的胶囊意为一组神经元，其激活(活动)向量反映了某类特定实体（可能是整体也可能是部分）的表征（参数）。本论文使用激活(活动)向量的模长（长度）来描述实体存在的概率，用激活向量的方向表征对应实例的参数。某一层级的活跃胶囊通过矩阵变换做出预测，预测结果会用来给更高层级的胶囊提供实例参数。当多个预测值达成一致时，一个高层级的胶囊就会被激活。论文中展示了差异化训练的多层胶囊系统可以在MNIST上达到当前最高水平的表现，在识别高度重叠的数字上也要比卷积网络要好得多。网络的实现中运用迭代的一致性路由机制（迭代路由协议机制）：当低层级的胶囊的预测向量和高层级胶囊的激活向量有较大的标量积时，这个低层级胶囊就会倾向于向高层级胶囊输出。

### 1 Introduction （简介）


Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Introspection is a poor guide to understanding how much of our knowledge of a scene comes from the sequence of fixations and how much we glean from a single fixation, but in this paper we will assume that a single fixation gives us much more than just a single identified object and its properties. We assume that our multi-layer visual system creates a parse tree-like structure on each fixation, and we ignore the issue of how these single-fixation parse trees are coordinated over multiple fixations.

人类视觉通过使用仔细确定的固定点序列来忽略不相关的细节，以确保只有极小部分的光学阵列以最高的分辨率被处理。要理解我们对场景的多少知识来自固定序列，以及我们从单个固定点中能收集到多少知识，内省不是一个好的指导，但是在本文中，我们假设单个固定点给我们提供的不仅仅是一个单一的识别对象及其属性。我们假设多层视觉系统在每个固定点上都会创建一个类似解析树这样的东西，并且单一固定解析树在多个固定点中如何协调的问题会被我们忽略掉。




Parse trees are generally constructed on the fly by dynamically allocating memory. Following Hinton et al. [2000], however, we shall assume that, for a single fixation, a parse tree is carved out of a fixed multilayer neural network like a sculpture is carved from a rock. Each layer will be divided into many small groups of neurons called “capsules” (Hinton et al. [2011]) and each node in the parse tree will correspond to an active capsule. Using an iterative routing process, each active capsule will choose a capsule in the layer above to be its parent in the tree. For the higher levels of a visual system, this iterative process will be solving the problem of assigning parts to wholes.

解析树通常通过动态分配内存来快速构建，但根据Hinton等人的论文「Learning to parse images，2000」，我们假设，对于单个固定点，从固定的多层神经网络中构建出一个解析树，就像从一块岩石雕刻出一个雕塑一样（雷锋网 AI 科技评论注： 意为只保留了部分树枝）。每个层被分成许多神经元组，这些组被称为“胶囊”（Hinton等人「Transforming auto-encoders，2011」），解析树中的每个节点就对应着一个活动的胶囊。通过一个迭代路由过程，每个活动胶囊将在更高的层中选择一个胶囊作为其在树中的父结点。对于更高层次的视觉系统，这样的迭代过程就很有潜力解决一个物体的部分如何层层组合成整体的问题。



The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity1 . We ensure that the length of the vector output of a capsule cannot exceed 1 by applying a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.


一个活动的胶囊内的神经元活动表示了图像中出现的特定实体的各种属性。这些属性可以包括许多不同类型的实例化参数，例如姿态（位置，大小，方向），变形，速度，反照率，色相，纹理等。一个非常特殊的属性是图像中某个类别的实例的存在。表示存在的一个简明的方法是使用一个单独的逻辑回归单元，它的输出数值大小就是实体存在的概率（雷锋网 AI 科技评论注： 输出范围在0到1之间，0就是没出现，1就是出现了）。在本文中，作者们探索了一个有趣的替代方法，用实例的参数向量的模长来表示实体存在的概率，同时要求网络用向量的方向表示实体的属性。为了确保胶囊的向量输出的模长不超过1，通过应用一个非线性的方式使矢量的方向保持不变，同时缩小其模长。



The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1. For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. We demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects.

胶囊的输出是一个向量，这一设定使得用强大的动态路由机制来确保胶囊的输出被发送到上述层中的适当的父节点成为可能。最初，输出经过耦合总和为1的系数缩小后，路由到所有可能的父节点。对于每个可能的父结点，胶囊通过将其自身的输出乘以权重矩阵来计算“预测向量”。如果这一预测向量和一个可能的父节点的输出的标量积很大，则存在自上而下的反馈，其具有加大该父节点的耦合系数并减小其他父结点耦合系数的效果。这就加大了胶囊对那一个父节点的贡献，并进一步增加了胶囊预测向量和该父节点输出的标量积。这种类型的“按协议路由”应该比通过最大池化实现的非常原始的路由形式更有效，其中除了保留本地池中最活跃的特征检测器外，忽略了下一层中所有的特征检测器。作者们论证了，对于实现分割高度重叠对象所需的“解释”，动态路由机制是一个有效的方式。



Convolutional neural networks (CNNs) use translated replicas of learned feature detectors. This allows them to translate knowledge about good weight values acquired at one position in an image to other positions. This has proven extremely helpful in image interpretation. Even though we are replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement, we would still like to replicate learned knowledge across space. To achieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make higher-level capsules cover larger regions of the image. Unlike max-pooling however, we do not throw away information about the precise position of the entity within the region. For low level capsules, location information is “place-coded” by which capsule is active. As we ascend the hierarchy, more and more of the positional information is “rate-coded” in the real-valued components of the
output vector of a capsule. This shift from place-coding to rate-coding combined with the fact that higher-level capsules represent more complex entities with more degrees of freedom suggests that the dimensionality of capsules should increase as we ascend the hierarchy.


卷积神经网络（CNN）使用学习得到的特征检测器的转移副本，这使得他们能够将图片中一个位置获得的有关好的权重值的知识，迁移到其他位置。这对图像解释的极大帮助已经得到证明。尽管作者们此次用矢量输出胶囊和按协议路由的最大池化替代CNN的标量输出特征检测器，他们仍然希望能够在整个空间中复制已习得的知识，所以文中构建的模型除了最后一层胶囊之外，其余的胶囊层都是卷积。与CNN一样，更高级别的胶囊得以覆盖较大的图像区域，但与最大池化不同，胶囊中不会丢弃该区域内实体精确位置的信息。对于低层级的胶囊，位置信息通过活跃的胶囊来进行“地点编码”。当来到越高的层级，越多的位置信息在胶囊输出向量的实值分量中被“速率编码”。这种从位置编码到速率编码的转变，加上高级别胶囊能够用更多自由度、表征更复杂实体的特性，表明更高层级的胶囊也相应地需要更高的维度。





---

参考文献：

[1].AI研习社.知乎 [如何看待Hinton的论文《Dynamic Routing Between Capsules》？](https://www.zhihu.com/question/67287444/answer/252315722)
