Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks
----

https://arxiv.org/abs/1312.6082v4

![mdnr_fig1.png](mdnr_fig1.png)

Figure 1: a) An example input image to be transcribed. The correct output for this image is “700”. b) The graphical model structure of our sequence transcription model, depicted using plate nota- tion (Buntine, 1994) to represent the multiple Si. Note that the relationship between X and H is deterministic. The edges going from L to Si are optional, but help draw attention to the fact that our definition of P(S | X) does not query Si for i > L.

# 3 Problem description

Street number transcription is a special kind of sequence recognition. Given an image, the task is to identify the number in the image. See an example in Fig. 1a. The number to be identified is a sequence of digits, s = s1 , s2 , . . . , sn . When determining the accuracy of a digit transcriber, we compute the proportion of the input images for which the length n of the sequence and every element si of the sequence is predicted correctly. There is no “partial credit” for getting individual digits of the sequence correct. This is because for the purpose of making a map, a building can only be found on the map from its address if the whole street number was transcribed correctly.

街道号码翻译是一种特殊的序列识别。这个任务是在给定一个图片后，识别出一个图片中的数字。例如图1a。被识别的数字是一个数字的序列。计算一个长度为n的序列中所有元素$s_i$是否识别正确的比例，通过这种方式来评价数字翻译器的精度。不存在“部分正确”这种情况。因为在标注地图时，一个建筑物只有在它的地址完全正确的时候才能被找到。

For the purpose of building a map, it is extremely important to have at least human level accuracy. Users of maps find it very time consuming and frustrating to be led to the wrong location, so it is essential to minimize the amount of incorrect transcriptions entered into the map. It is, however, acceptable not to transcribe every input image. Because each street number may have been photographed many times, it is still quite likely that the proportion of buildings we can place on the map is greater than the proportion of images we can transcribe. 

为了正确构建地图这个目的，至少达到人类的水平是非常重要的。地图用户在被引导到错误的地点时是非常费时并且很沮丧的，所以降低翻译的错误率是非常重要的。尽管，不是每个输入图片都能被翻译，却是可以接受的。因为每个街道号码都被会多次拍摄到，很大的可能性上，建筑物被正确标识的比例会比图片被正确翻译的比例要高。

We therefore advocate evaluating this task based on the coverage at certain levels of accuracy, rather than evaluating only the total degree of accuracy of the system. To evaluate coverage, the system must return a confidence value, such as the probability of the most likely prediction being correct. Transcriptions below some confidence threshold can then be discarded. The coverage is defined to be the proportion of inputs that are not discarded. The coverage at a certain specific accuracy level is the coverage that results when the confidence threshold is chosen to achieve that desired accuracy level. For map-making purposes, we are primarily interested in coverage at 98% accuracy or better, since this roughly corresponds to human accuracy.

因此，我们主张使用某种水平的正确率的覆盖率来评估这个任务，而不是仅仅评估系统的整个正确率。为了评估覆盖率，系统需要返回一个信心值，例如最有可能的预测是正确的概率。在一定信心阈值以下的翻译就可以被丢弃。输入没有被丢弃的比率被称为覆盖率。在某个特定准确水平下的覆盖率是当选定一个期望达到的准确率阈值后产生的覆盖率结果。对于构建地图的目的，我们主要关注在98%或者更高的准确率下的覆盖率，因为这个准确率大概接近人类的水平。

Using confidence thresholding allows us to improve maps incrementally over time–if we develop a system with poor accuracy overall but good accuracy at some threshold, we can make a map with partial coverage, then improve the coverage when we get a more accurate transcription system in the future. We can also use confidence thresholding to do as much of the work as possible via the automated system and do the rest using more expensive means such as hiring human operators to transcribe the remaining difficult inputs.

使用信心阈值使我们能够随着时间逐步改进地图。如果我们开发了一个在一定信心阈值下有高准确率，而总体准确率低的系统，我们可以创建一个部分覆盖的地图，将来再来改进覆盖率得到一个更准确的系统。我们也可以通过使用信心阈值来用自动化系统尽可能多的完成工作，而把剩下的工作交给更加昂贵的方法，比如雇佣操作员来翻译剩下比较难的输入。

One special property of the street number transcription problem is that the sequences are of bounded length. Very few street numbers contain more than five digits, so we can use models that assume the sequence length n is at most some constant N, with N = 5 for this work. Systems that make such an assumption should be able to identify whenever this assumption is violated and refuse to return a transcription so that the few street numbers of length greater than N are not incorrectly added to the map after being transcribed as being length N. (Alternately, one can return the most likely sequence of length N, and because the probability of that transcription being correct is low, the default confidence thresholding mechanism will usually reject such transcriptions without needing special code for handling the excess length case)

序列有长度上限，这是街道号码翻译问题的一个特殊属性。很少有街道多于5个数字，所以我们的模型假设序列的长度n最多是常数N，这个任务中N=5。在假定这个条件后的系统，必须能够识别出这个假设是否被违反，然后拒绝进行翻译，这样很少的长度大于N的街道号码就不会被错误的翻译成长度为N的数字，从而被添加到地图中去。（或者，也可以返回最可能的长度为N的序列，然后因为这个翻译为正确的概率很低，默认的信心阈值机制将会拒绝这个翻译，从而不需要特殊的代码来处理这个超长度问题）

# 4 Methods

Our basic approach is to train a probabilistic model of sequences given images. Let S represent the output sequence and X represent the input image. Our goal is then to learn a model of P (S | X ) by maximizing log P (S | X ) on the training set.

S表示输出序列，X表示输入序列。我们的目标是学习一个模型 P(S|X)，在训练集上最大化 log P(S|X)。

To model S, we define S as a collection of N random variables S1 , . . . , SN representing the elements of the sequence and an additional random variable L representing the length of the sequence. We assume that the identities of the separate digits are independent from each other, so that the probability of a specific sequence s = s1 , . . . , sn is given by

我们定义S是N个随机变量$S_1, ..., S_N$的集合，代表序列中的每个元素，还有一个额外的随机变量L表示序列的长度。我们假设不同的数字之间是独立的，于是给定序列S的概率为：

$$ P(S=s|X) = P(L=n|X) \prod_{i=1}^n P(S_i=s_i|X). $$

This model can be extended to detect when our assumption that the sequence has length at most N is violated. To allow for detecting this case, we simply add an additional value of L that represents this outcome.

这个模型可以被扩展来检测序列长度最多是N的假设是否被违反。为了检测这种情况，我们简单得添加一个额外的值L用来表示这个输出。

Each of the variables above is discrete, and when applied to the street number transcription problem, each has a small number of possible values: L has only 7 values (0, . . . , 5, and “more than 5”), and each of the digit variables has 10 possible values. This means it is feasible to represent each of them with a softmax classifier that receives as input features extracted from X by a convolutional neural network. We can represent these features as a random variable H whose value is deterministic given X. In this model, P(S | X) = P(S | H). See Fig. 1b for a graphical model depiction of the network structure.

上面的每个变量都是离散的，在应用到街道数字翻译问题上时，每个变量都有几个可能的值：L只有7个值（0，...，5，多于5），每个数字变量有10中可能的值。这意味着对它们中的每一个使用一个softmax分类器来表示是可能的，这些分类器接收从通过convnet从X中提取的特征。我们把这些特征用随机变量H来表示，它的值在X给定时是确定的。在这种模型下，P(S|X) = P(S|H)。图1b是这个网络结构的图形化表示。

To train the model, one can maximize log P (S | X ) on the training set using a generic method like stochastic gradient descent. Each of the softmax models (the model for L and each Si) can use exactly the same backprop learning rule as when training an isolated softmax layer, except that a digit classifier softmax model backprops nothing on examples for which that digit is not present.

为了训练这个模型，在训练集上通过SGD最大化log P(S|X)。每个softmax模型（L和每个$S_i$的模型）可以使用和训练单独的softmax层一样的反向学习规则来训练，除了在没有数字的情况下，数字分类器没有响应。

At test time, we predict

测试的时候，我们预测

$$ s = (l,s_1,...,s_l) = argmax_{L,S_1,...,S_L} log P(S | X). $$

This argmax can be computed in linear time. The argmax for each character can be computed independently. We then incrementally add up the log probabilities for each character. For each length l, the complete log probability is given by this running sum of character log probabilities, plus log P (l | x). The total runtime is thus O(N ).

这个argmax可以在线性时间被计算。每个字符的argmax可以独立的计算。然后逐步加上每个字符的概率对数。对每个长度l，完整的概率对数等于字符概率对数的和，加上log P(l|x)。总共的时间复杂度为O(N)。

We preprocess by subtracting the mean of each image. We do not use any whitening (Hyva ̈rinen et al., 2001), local contrast normalization (Sermanet et al., 2012), etc.

预处理，减去了图片的均值。没有使用任何白化（Hyva）和局部对比度归一（Sermanet）等等。

# 5 Experiments

In this section we present our experimental results. First, we describe our state of the art results on the public Street View House Numbers dataset in section 5.1. Next, we describe the performance of this system on our more challenging, larger but internal version of the dataset in section 5.2. We then present some experiments analyzing the performance of the system in section 5.4.

实验结果。首先，5.1节介绍了SVHN数据集上的先进结果。然后，5.2节介绍了在一个更具挑战的，更大的，内部数据集上的性能。5.4节分析了这个系统的性能。

## 5.1 Public Street View House Numbers dataset

The Street View House Numbers (SVHN) dataset (Netzer et al., 2011) is a dataset of about 200k street numbers, along with bounding boxes for individual digits, giving about 600k digits total. To our knowledge, all previously published work cropped individual digits and tried to recognize those. We instead take original images containing multiple digits, and focus on recognizing them all simultaneously.

街景房屋数字数据集（Netzer）有大概200k街道数字，并且每个数字都有边界框，一共600k个数字。之前的工作都是把单独的数字截取出来，再尝试识别它们。而我们直接使用包含多个数字的图片，同时来识别全部。

We preprocess the dataset in the following way – first we find the small rectangular bounding box that will contain individual character bounding boxes. We then expand this bounding box by 30% in both the x and the y direction, crop the image to that bounding box and resize the crop to 64 × 64 pixels. We then crop a 54 × 54 pixel image from a random location within the 64 × 64 pixel image. This means we generated several randomly shifted versions of each training example, in order to increase the size of the dataset. Without this data augmentation, we lose about half a percentage point of accuracy. Because of the differing number of characters in the image, this introduces considerable scale variability – for a single digit street number, the digit fills the whole box, meanwhile a 5 digit street number will have to be shrunk considerably in order to fit.

我们通过以下方式来进行预处理。首先，找到找到包含所有单个字符边框的矩形区域。然后对x和y方向都扩大30%，截取这部分的图片后缩放到64x64。然后从这个64x64的图片中随机截取一个54x54的区域。这意味着我们对每个训练样本生成了几个随机的版本，从而增大了数据集。如果不进行这种数据增强，我们的成功率会下降一半。由于图片中的字符个数不同，这引入了相当的尺度可变性，对于只有一个数字的图片，这个数字充满了整个边界，而5个数字的图片就需要缩小来适应边界。

Our best model obtained a sequence transcription accuracy of 96.03%. This is not accurate enough to use for adding street numbers to geographic location databases for placement on maps. However, using confidence thresholding we obtain 95.64% coverage at 98% accuracy. Since 98% accuracy is the performance of human operators, these transcriptions are acceptable to include in a map. We encourage researchers who work on this dataset in the future to publish coverage at 98% accuracy as well as the standard accuracy measure. Our system achieves a character-level accuracy of 97.84%. This is slightly better than the previous state of the art for a single network on the individual character task of 97.53% (Goodfellow et al., 2013).

我们最好的模型的准确率为96.03%。这对于地图应用来说还不够好。但是，使用了信心阈值我们以98%准确率得到了95.64%的覆盖率。因为98%准确率已经是人能达到的水平了，这些翻译对于地图来说就可以被使用了。我们鼓励研究人员在98%的准确率上发表覆盖率。这要稍微好过以前的最好结果，一个单网络在单独字符任务，97.53%。

Training this model took approximately six days using 10 replicas in DistBelief. The exact training time varies for each of the performance measures reported above–we picked the best stopping point for each performance measure separately, using a validation set.

训练这个模型大概花了6天，10个副本，使用DistBelief。达到不同的性能需要的确切时间不同，我们使用验证集在最好的时间点结束训练。

Our best architecture consists of eight convolutional hidden layers, one locally connected hidden layer, and two densely connected hidden layers. All connections are feedforward and go from one layer to the next (no skip connections). The first hidden layer contains maxout units (Goodfellow et al., 2013) (with three filters per unit) while the others contain rectifier units (Jarrett et al., 2009; Glorot et al., 2011). The number of units at each spatial location in each layer is [48, 64, 128, 160] for the first four layers and 192 for all other locally connected layers. The fully connected layers contain 3,072 units each. Each convolutional layer includes max pooling and subtractive normalization. The max pooling window size is 2 × 2. The stride alternates between 2 and 1 at each layer, so that half of the layers don’t reduce the spatial size of the representation. All convolutions use zero padding on the input to preserve representation size. The subtractive normalization operates on 3x3 windows and preserves representation size. All convolution kernels were of size 5 × 5. We trained with dropout applied to all hidden layers but not the input.

最好的架构包含8个conv层，1个locally connected层，2个fc层。第一个隐藏层包含maxout单元（Goodfellow2013），其它层使用rectifier单元（Jarrett2009）。前4个层包含的单元数量为[48, 64, 128, 160]，locally connected层有192个单元。每个fc层有3,072个单元。每个conv层包含max pooling和减法标准化。maxpool的核为2x2。步长为1和2交替使用，所以一半的层空间大小会被减少一半。所有的conv层都使用0填充来保持尺寸。减法标准化操作在3x3的窗口上来保持大小。所有的卷积核大小为5x5。所有的隐藏层都使用了dropout，除了输入层。


## 5.2 Internal Street View data

Internally, we have a dataset with tens of millions of transcribed street numbers. However, on this dataset, there are no ground truth bounding boxes available. We use an automated method (beyond the scope of this paper) to estimate the centroid of each house number, then crop to a 128 × 128 pixel region surrounding the house number. We do not rescale the image because we do not know the extent of the house number. This means the network must be robust to a wider variation of scales than our public SVHN network. On this dataset, the network must also localize the house number, rather than merely localizing the digits within each house number. Also, because the training set is larger in this setting, we did not need augment the data with random translations.

在内部，我们有一个几千万的街道数字数据集。但是这个数据集里没有边界框。我们使用了一种自动化的方法（不在本文的讨论范围之内）来估计每个数字的中心，然后截取一个包含数字的128x128的区域。因为我们不知道数字的区域，所以没有进行拉伸。这意味着这个网络需要比公开的SVHN的对于更大的尺度变化具有健壮性。在这个数据集上，这个网络还需要定位房屋数字，而不仅仅是在房屋数字里来定位每个字符。而且因为训练集足够大，我们不需要通过随机变换来增强数据。

![mdnr_fig2.png](mdnr_fig2.png)

Figure 2: Difficult but correctly transcribed examples from the internal street numbers dataset. Some of the challenges in this dataset include diagonal or vertical layouts, incorrectly applied blurring from license plate detection pipelines, shadows and other occlusions.

图2：困难但还是正确被翻译的例子。来自内部的街道数字数据集。这个数据集中的难点有，对角线或者垂直的布局，还有牌照检测环节错误地模糊了部分数据，阴影和其他原因。

This dataset is more difficult because it comes from more countries (more than 12), has street numbers with non-digit characters and the quality of the ground truth is lower. See Fig. 2 for some examples of difficult inputs from this dataset that our system was able to transcribe correctly, and Fig. 3 for some examples of difficult inputs that were considered errors.

这个数据更难处理，因为它来自多个国家（多于12个），也有非数字字符，并且真值的质量也不高。见图2，一些我们系统不能识别的困难例子。图3见我们认为错误的例子。

![mdnr_fig3.png](mdnr_fig3.png)

Figure 3: Examples of incorrectly transcribed street numbers from the large internal dataset (transcription vs. ground truth). Note that for some of these, the “ground truth” is also incorrect. The ground truth labels in this dataset are quite noisy, as is common in real world settings. Some reasons for the ground truth errors in this dataset include: 1. The data was repurposed from an existing indexing pipeline where operators manually entered street numbers they saw. It was impractical to use the same size of images as the humans saw, so heuristics were used to create smaller crops. Sometimes the resulting crop omits some digits. 2. Some examples are fundamentally ambiguous, for instance street numbers including non-digit characters, or having multiple street numbers in same image which humans transcribed as a single number with an arbitrary separator like “,” or “-”.

图3：内部数据集上错误翻译的例子（翻译 vs. 真值）。对有的情况，“真值”也是错误的。在这个数据集中的真值标签有很多的噪音，这在真实环境中很常见。真值错误的原因有：

We obtained an overall sequence transcription accuracy of 91% on this more challenging dataset. Using confidence thresholding, we were able to obtain a coverage of 83% with 99% accuracy, or 89% coverage at 98% accuracy. On this task, due to the larger amount of training data, we did not see significant overfitting like we saw in SVHN so we did not use dropout. Dropout tends to increase training time, and our largest models are already very costly to train. We also did not use maxout units. All hidden units were rectifiers (Jarrett et al., 2009; Glorot et al., 2011). Our best architecture for this dataset is similar to the best architecture for the public dataset, except we use only five convolutional layers rather than eight. (We have not tried using eight convolutional layers on this dataset; eight layers may obtain slightly better results but the version of the network with five convolutional layers performed accurately enough to meet our business objectives) The locally connected layers have 128 units per spatial location, while the fully connected layers have 4096 units per layer.

在这个更有挑战的数据集上我们得到了整体序列翻译的正确率为91%。使用信心阈值的话，在99%的正确率下，我们得到了83%的覆盖率，或者98%正确率下89%的覆盖率。在这个任务上，因为训练数据巨大，我们没有发现在SVHN上那么显著的过拟合现象，所以我们没有使用dropout。dropout趋向于增加训练时间，而我们最大的模型的训练成本已经很高了。而且我们没有使用maxout。所有的隐藏单元都是用rectifier。最好的架构和在公开数据集上的类似，除了我们仅使用了5个conv层而不是8个。（我们没有尝试使用8层的网络，8层可能可以获得更好的结果，但是5层网络的性能已经可以满足我们的商业目标）。locally connected层在每个空间位置有128个单元，而每个fc层有4096个神经元。

## 5.3 CAPTCHA puzzles dataset

CAPTCHAs are reverse turing tests designed to use distorted text to distinguish humans and machines running automated text recognition software. reCAPTCHA is a leading CAPTCHA provider with an installed base of several hundreds of thousands of websites. To evaluate the generality of the proposed approach to recognizing arbitrary text, we created a dataset composed of the hardest CAPTCHA puzzle examples of which are shown in Figure 4.

验证码是一种逆图灵测试，同通过歪曲的文本来区分人类和运行自动文本识别软件的机器。reCAPTCHA是领先的验证码服务被成百上千的网站使用。为了评估这个识别任意文本的方法的通用性，我们创建了一个由最难的验证码样本组成的数据集，图4显示了一些例子。

The model we use is similar to the best one used over the SVHN dataset with the following differences: we use 9 convolutional layers in this network instead of 11, with the first layer containing normal rectifier units instead of maxouts, the convolutional layers are also slightly bigger, while the fully connected ones smaller. The output of this model is case-sensitive and it can handle up to 8 character long sequences. The input is one of the two CAPTCHA words cropped to a size of 200x40 where random sub-crops of size 195x35 are taken. The performance reported was taken directly from a test set of 100K samples and a training set in the order of millions of CAPTCHA images.

我们使用的模型类似于在SVHN数据集上最好的那个，除了这几点以外：我们使用了9层网络而不是11层，第一层使用rectifier而不是maxout，conv层也更大一些，而fc层要小一些。这个模型的输出是大小写敏感的，可以处理最多8个字符的序列。输入是两个验证码单词中的一个，大小为200x40，然后随机截取了195x35的子区域。测试的数据在一个100K的样本上运行，训练集包含百万级别的验证码图片。

![mdnr_fig4.png](mdnr_fig4.png)

Figure 4: Examples of images from the hard CAPTCHA puzzles dataset.

With this model, we are able to achieve a 99.8% accuracy on transcribing the hardest reCAPTCHA puzzle. It is important to note that these results do not indicate a reduction in the anti-abuse effectiveness of reCAPTCHA as a whole. reCAPTCHA is designed to be a risk analysis engine taking a variety of different cues from the user to make the final determination of human vs bot. Today distorted text in reCAPTCHA serves increasingly as a medium to capture user engagements rather than a reverse turing in and of itself. These results do however indicate that the utility of distorted text as a reverse turing test by itself is significantly diminished.

使用这个模型，我们获得了99.8%的正确率。一个很重要需要指出的是，总体来说，这个结果并没有指出 防止滥用验证码有效性 的减少。（不明白啥意思）reCAPTCHA被设计作为一种风险评估的机制，这种机制使用多种不同的用户线索，来做出最后区分人类还是机器的决定。今天扭曲文本这种方式逐渐成为一种获得用户参与的媒介，而不是作为一种逆图灵测试。然而，这些结果表明仅利用扭曲文本本身来作为逆图灵测试的效用明显降低了。

## 5.4 Performance analysis

In this section we explore the reasons for the unprecedented success of our neural network architecture for a complicated task involving localization and segmentation rather than just recognition. We hypothesize that for such a complicated task, depth is crucial to achieve an efficient representation of the task. State of the art recognition networks for images of cropped and centered digits or objects may have between two to four convolutional layers followed by one or two densely connected hidden layers and the classification layers (Goodfellow et al., 2013). In this work we used several more convolutional layers. We hypothesize that the depth was crucial to our success. This is most likely because the earlier layers can solve the localization and segmentation tasks, and prepare a representation that has already been segmented so that later layers can focus on just recognition. Moreover, we hypothesize that such deep networks have very high representational capacity, and thus need a large amount of data to train successfully. Prior to our successful demonstration of this system, it would have been reasonable to expect that factors other than just depth would be necessary to achieve good performance on these tasks. For example, it could have been possible that a sufficiently deep network would be too difficult to optimize. In Fig. 5, we present the results of an experiment that confirms our hypothesis that depth is necessary for good performance on this task. For control experiments showing that large shallow models cannot achieve the same performance, see Fig. 6.

本节，我们来探索这个神经网络在包含定位、分割、识别这种复杂的任务上获得的空前成功的原因。我们假设对于这样的复杂任务，深度是达到有效性能的关键因素。对于物体或者数字截取后并居中的图片，先进的识别网络可能有2-4个卷积层，后跟一两个fc层和分类层（Goodfellow2013）。在这个任务中多了几个conv层。我们假设深度对我们的成功很关键。这很有可能是因为前几层可以解决定位和分割任务，准备好一个已经被分割后的表达方式，而后几层可以只关注于识别。另外，我们假设这样的深度网络有很强的表达能力，只是需要很大的数据量来成功进行训练。在我们成功展示这一系统之前，期望除了深度之外的因素对于实现这些任务的良好表现是必要的。 例如，可能的是，足够深的网络将难以优化。 在图5，我们提出一个实验的结果，证实了我们的假设，深度对于这个任务的良好表现是必要的。 对于控制实验，显示大型浅模型不能达到相同的性能，见图6。

![mdnr_fig5.png](mdnr_fig5.png)

Figure 5: Performance analysis experiments on the public SVHN dataset show that fairly deep architectures are needed to obtain good performance on the sequence transcription task.

图5：在SVHN上的性能分析实验表明，为得到好的效果需要相当的深度架构。

![mdnr_fig6.png](mdnr_fig6.png)

Figure 6: Performance analysis experiments on the public SVHN dataset show that increasing the number of parameters in smaller models does not allow such models to reach the same level of performance as deep models. This is primarily due to overfitting.

图6：通过增加小模型的参数数量不能使得它们达到深度模型一样的性能。这很大原因是过拟合。

## 5.5 Application to Geocoding

The motivation for the development of this model was to decrease the cost of geocoding as well as scale it worldwide and keep up with change in the world. The model has now reached a high enough quality level that we can automate the extraction of street numbers on Street View images. Also, even if the model can be considered quite large, it is still efficient.

We can for example transcribe all the views we have of street numbers in France in less than an hour using our Google infrastructure. Most of the cost actually comes from the detection stage that locates the street numbers in the large Street View images. Worldwide, we automatically detected and transcribed close to 100 million physical street numbers at operator level accuracy. Having this new dataset significantly increased the geocoding quality of Google Maps in several countries especially the ones that did not already have other sources of good geocoding. In Fig. 7, you can see some automatically extracted street numbers from Street View imagery captured in South Africa.

# 6 Discussion

We believe with this model we have solved OCR for short sequences for many applications. On our particular task, we believe that now the biggest gain we could easily get is to increase the quality of the training set itself as well as increasing its size for general OCR transcription.

One caveat to our results with this architecture is that they rest heavily on the assumption that the sequence is of bounded length, with a reasonably small maximum length N. For unbounded N, our method is not directly applicable, and for large N our method is unlikely to scale well. Each separate digit classifier requires its own separate weight matrix. For long sequences this could incur too high of a memory cost. When using DistBelief, memory is not much of an issue (just use more machines) but statistical efficiency is likely to become problematic. Another problem with long sequences is the cost function itself. It’s also possible that, due to longer sequences having more digit probabilities multiplied together, a model of longer sequences could have trouble with systematic underestimation of the sequence length.

One possible solution could be to train a model that outputs one “word” (N character sequence) at a time and then slide it over the entire image followed by a simple decoding. Some early experiments in this direction have been promising.
Perhaps our most interesting finding is that neural networks can learn to perform complicated tasks such as simultaneous localization and segmentation of ordered sequences of objects. This approach of using a single neural network as an entire end-to-end system could be applicable to other prob- lems, such as general text transcription or speech recognition.

# Appendix A: Example inference

In this appendix we provide a detailed example of how to run inference in a trained network to transcribe a house number. The purpose of this appendix is to remove any ambiguity from the more general description in the main text.
Transcription begins by computing the distribution over the sequence S given an image X. See Fig. 8 for details of how this computation is performed.

To commit to a single specific sequence transcription, we need to compute argmaxs P (S = s | H). It is easiest to do this in log scale, to avoid multiplying together many small numbers, since such multiplication can result in numerical underflow. i.e., in practice we actually compute argmaxs log P (S = s | H).
Note that logsoftmax(z) can be computed efficiently and with numerical stability with the formula log softmax(z)i = zi −  j exp(zj ). It is best to compute the log probabilities using this stable approach, rather than first computing the probabilities and then taking their logarithm. The latter approach is unstable; it can incorrectly yield −∞ for small probabilities.

Suppose that we have all of our output probabilities computed, and that they are the following (these are idealized example values, not actual values from the model):

![mdnr_tab1.png](mdnr_tab1.png)

![mdnr_tab2.png](mdnr_tab2.png)

Refer to the example input image in Fig. 8 to understand these probabilities. The correct length is 3. Our distribution over L accurately reflects this, though we do think there is a reasonable possibility that L is 4– maybe the edge of the door looks like a fourth digit. The correct transcription is 175, and we do assign these digits the highest probability, but also assign significant probability to the first digit being a 7, the second being a 9, or the third being a 6. There is no fourth digit, but if we parse the edge of the door as being a digit, there is some chance of it being a 1. Our distribution over the fifth digit is totally uniform since there is no fifth digit.

Our independence assumptions mean that when we compute the most likely sequence, the choice of which digit appears in each position doesn’t affect our choice of which digit appears in the other positions. We can thus pick the most likely digit in each position separately, leaving us with this table:

![mdnr_tab3.png](mdnr_tab3.png)

Finally, we can complete the maximization by explicitly calculating the probability of all seven possible se- quence lengths:

![mdnr_tab4.png](mdnr_tab4.png)

Here the third column is just a cumulative sum over log P (SL ) so it can be computed in linear time. Likewise, the fourth column is just computed by adding the third column to our existing log P (L) table. It is not even necessary to keep this final table in memory, we can just use a for loop that generates it one element at a time and remembers the maximal element.

The correct transcription, 175, obtains the maximal log probability of −0.42144, and the model outputs this correct transcription.

![mdnr_fig8.png](mdnr_fig8.png)

Figure 8: Details of the computational graph we used to transcribe house numbers. In this diagram, we show how we compute the parameters of P (S | X), where X is the input image and S is the sequence of numbers depicted by the image. We first extract a set of features H from X using a convolutional network with a fully connected final layer. Note that only one such feature vector is extracted for the entire image. We do not use an HMM that models features explicitly extracted at separate locations. Because the final layer of the convolutional feature extractor is fully connected and has no weight sharing, we have not explicitly engineered any concept of spatial location into this representation. The network must learn its own means of representing spatial location in H. Six separate softmax classifiers are then connected to this feature vector H, i.e., each softmax classifier forms a response by making an affine transformation of H and normalizing this response with the softmax function. One of these classifiers provides the distribution over the sequence length P (L | H), while the others provide the distribution over each of the members of the sequence, P(S1 | H),...,P(S5 | H).
