# Chapter 12. Applications

* 싸이그래머 / QGM : 파트 4 - 딥러닝 [1]
* 김무성

# Contents
* 12.1 Large Scale Deep Learning
    - 12.1.1 Fast CPU Implementations
    - 12.1.2 GPU Implementations
    - 12.1.3 Large Scale Distributed Implementations
    - 12.1.4 Model Compression
    - 12.1.5 Dynamic Structure
    - 12.1.6 Specialized Hardware Implementations of Deep Networks
* 12.2 Computer Vision
    - 12.2.1 Preprocessing
        - 12.2.1.1 Contrast Normalization
        - 12.2.1.2 Dataset Augmentation
* 12.3 Speech Recognition
* 12.4 Natural Language Processing
    - 12.4.1 n-grams
    - 12.4.2 Neural Language Models
    - 12.4.3 High-Dimensional Outputs
        - 12.4.3.1 Use of a Short List
        - 12.4.3.2 Hierarchical Softmax
        - 12.4.3.3 Importance Sampling
        - 12.4.3.4 Noise-Contrastive Estimation and Ranking Loss
    - 12.4.4 Combining Neural Language Models with n-grams
    - 12.4.5 Neural Machine Translation
        - 12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data
    - 12.4.6 Historical Perspective
* 12.5 Other Applications
    - 12.5.1 Recommender Systems
        - 12.5.1.1 Exploration Versus Exploitation
        - 12.5.2 Knowledge Representation, Reasoning and Question Answering
            - 12.5.2.1 Knowledge, Relations and Question Answering

In this chapter, we describe 
* how to use deep learning to solve applications in 
    - computer vision, 
    - speech recognition, 
    - natural language processing, and 
    - other applicationareas of commercial interest. 
* We begin by discussing 
    - the large scale neuralnetwork implementations 
        - required for most serious AI applications. 
* Next, we review several speciﬁc application areas that 
    - deep learning has been used to solve. 
    - While one goal of deep learning is to design algorithms that are capable of solving a broad variety of tasks, so far some degree of specialization is needed.

# 12.1 Large Scale Deep Learning
* 12.1.1 Fast CPU Implementations
* 12.1.2 GPU Implementations
* 12.1.3 Large Scale Distributed Implementations
* 12.1.4 Model Compression
* 12.1.5 Dynamic Structure
* 12.1.6 Specialized Hardware Implementations of Deep Networks

Deep learning is based on the philosophy of connectionism: 
* while an individual biological neuron or 
* an individual feature in a machine learning model is
    - not intelligent, 
* a large population of these neurons or 
* features acting together can 
    - exhibit intelligent behavior.

<font color="red">It truly is important to emphasize the fact that the number of neurons must be large</font>. 
* One of the key factors responsible for the improvement in neural network’s accuracy and the improvement of the complexity of tasks they can solve between the 1980s and today is the dramatic increase in the size of the networks we use.

<img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part1/study01/dml01/figures/fig1.6.png" width=600 />

<img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part1/study01/dml01/figures/fig1.7.png" width=600 />

## 12.1.1 Fast CPU Implementations

Traditionally, neural networks were trained using the CPU of a single machine.
* Today, this approach is generally considered insuﬃcient. 
* We now mostly use 
    - GPU computing or 
    - the CPUs of many machines networked together. 
* Before moving to these expensive setups, 
    - researchers worked hard to demonstrate that 
        - CPUs could not manage 
            - the high computational workload 
                - required by neural networks.
* A description of how to implement eﬃcient numerical CPU code is beyond the scope of this book, but <font color="red">we emphasize here that careful implementation for speciﬁc CPU families can yield large improvements</font>.
    - For example, in 2011, the best CPUs available could run neural network workloads faster 
        - when using ﬁxed-point arithmetic 
            - rather than ﬂoating-point arithmetic. 
    - By creating a carefully tuned ﬁxed-point implementation,
        - Vanhoucke et al. (2011) 
        - obtained a 3×speedup over a strong ﬂoating-point system.
* Other strategies, 
    - besides choosing whether to use 
        - ﬁxed or 
        - ﬂoating point,
    - include 
        - optimizing data structures 
            - to avoid cache misses 
        - and using vector instructions. 
* <font color="red">Many machine learning researchers neglect these implementation details</font>, but when the performance of an implementation restricts
    - the size of the model, 
    - the accuracy of the model suﬀers.

## 12.1.2 GPU Implementations

#### 참고
* [2] Training ConvNets in practice : Data augmentation, transfer learning, Distributed training, CPU/GPU bottlenecks, Efficient convolutions - http://cs231n.stanford.edu/slides/winter1516_lecture11.pdf

Most modern neural network implementations are based on graphics processing units. Graphics processing units (GPUs) are specialized hardware components that were originally developed for graphics applications.

* Neural networks usually involve large and numerous buﬀers of
    - parameters, 
    - activation values, and 
    - gradient values,
    - each of which must be completely updated 
        - during every step of training. 
* These buﬀers are large enough to fall outside 
    - the cache of a traditional desktop computer 
        - so the memory bandwidth of the system often 
            - becomes the rate limiting factor.
* GPUs oﬀer a compelling advantage over CPUs 
    - due to their high memory bandwidth.
* Neural network training algorithms typically do not involve 
    - much branching or 
    - sophisticated control, 
    - so they are appropriate for GPU hardware. 
* Since neural networks can be divided into 
    - multiple individual “neurons” that 
        - can be processed independently 
            - from the other neurons in the same layer, 
    - neural networks easily beneﬁt from 
        - <font color="red">the parallelism of GPU computing</font>.

Due to the diﬃculty of writing high performance GPU code, researchers should structure their workﬂow to avoid needing to write new GPU code in order to test new models or algorithms.
* Typically, one can do this by building a software library of 
    - high performance operations like 
        - convolution and 
        - matrix multiplication, 
    - then specifying models in terms of calls to this library of operations.
        - Pylearn2
        - Theano
        - cuda-convnet
        - TensorFlow
        - Torch

## 12.1.3 Large Scale Distributed Implementations

#### 참고
* [3] Toward Scalable Deep Learning -  http://mlcenter.postech.ac.kr/files/attach/workshop_fall_2015/%EC%84%9C%EC%9A%B8%EB%8C%80%ED%95%99%EA%B5%90_%EC%9C%A4%EC%84%B1%EB%A1%9C_%EA%B5%90%EC%88%98_v1.pdf
* [4] Large Scale Distributed Deep Networks - http://www.slideshare.net/HiroyukiVincentYamaz/large-scale-distributed-deep-networks
* [5] Large Scale Deep Learning Jeff Dean - http://www.slideshare.net/hustwj/cikm-keynotenov2014
* [6] DeepDist : Lightning-Fast Deep Learning on Spark Via parallel stochastic gradient updates - http://deepdist.com/

In many cases, the computational resources available on a single machine areinsuﬃcient. We therefore want to distribute the workload of training and inference across many machines
<img src="http://docplayer.net/docs-images/26/7315497/images/71-0.png" width=600 />
* <font color="red">data parallelism</font>
    - Distributing inference is simple, because each input example we want to processcan be run by a separate machine. This is known as data parallelism
    - Data parallelism during training is somewhat harder
    - We can increase the sizeof the minibatch used for a single SGD step, but usually we get less than linear
    - It would be better to allow multiple machines to compute multiple gradient descent steps in parallel.
        - Unfortunately,the standard deﬁnition of gradient descent is as a completely sequential algorithm
        - This can be solved using asynchronous stochastic gradient descent
    - asynchronous stochastic gradient descent
        - shared memory (one machine)
            - In this approach, several processor cores 
                - share the memory representing the parameters. 
            - Each core reads parameters without alock, 
                - then computes a gradient, 
                - then increments the parameters 
                    - without a lock 
        - parameter server (multi-machines)
        <img src="http://image.slidesharecdn.com/presentationslides-160107181851/95/large-scale-distributed-deep-networks-32-638.jpg?cb=1452191027" width=600 />
* <font color="red">model parallelism</font>
    - It is also possible to get model parallelism, where multiple machines work together on a single data point, with each machine running a diﬀerent part of the model. This is feasible for both inference and training

## 12.1.4 Model Compression

#### 참고
* [7] Techniques for Efficient Implementation of Deep Neural Networks - http://www.slideshare.net/embeddedvision/techniques-for-efficient-implementation-of-deep-neural-networks-a-presentation-from-stanford
* [8] Compressing CNN for Mobile Device - http://mlcenter.postech.ac.kr/files/attach/workshop_fall_2015/%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90_%EA%B9%80%EC%9A%A9%EB%8D%95_%EB%B0%95%EC%82%AC.pdf

In many commercial applications, it is much more important that the time and memory cost of 
* <font color="red">running inference</font> in a machine learning model be low 
    - than that the time and memory cost of <font color="blue">training</font> be low.
* For applications that do not require personalization, 
    - it is possible to train a model once, 
    - then deploy it to be used bybillions of users.

<font color="red">A key strategy for reducing the cost of inference is model compression</font>
* The basic idea of model compression is 
    - to <font color="blue">replace</font> the original, <font color="blue">expensive model</font> with 
    - a <font color="green">smaller model</font> 
        - that requires less memory and runtime to store and evaluate.

These large models learn some functionf(x), but do so using many moreparameters than are necessary for the task. Their size is necessary only due to the limited number of training examples.

## 12.1.5 Dynamic Structure

One strategy for <font color="red">accelerating data processing systems</font> in general is to build systems that have <font color="red">dynamic structure</font> in the <font color="blue">graph describing the computation needed to process an input</font>.
* Data processing systems can dynamically determine which subset of many neural networks should be run on a given input. 
* Individual neural networks can also exhibit dynamic structure internally by determining which subsetof features (hidden units) to compute given information from the input. 
* This form of dynamic structure inside neural networks is sometimes called <font color="red">conditional computation</font>

Dynamic structure of computations is a basic computer science principle appliedgenerally throughout the software engineering discipline. 
* The simplest versions of dynamic structure applied to neural networks are based on determining which subset of some group of neural networks (or other machine learning models) should be applied to a particular input

A venerable strategy for accelerating inference in a classiﬁer is to use a <font color="red">cascade of classiﬁers</font>. 
* The cascade strategy may be applied when the goal is to detect the presence of a rare object (or event).
* Violaand Jones (2001) used a cascade of boosted decision trees to implement a fast androbust face detector suitable for use in handheld digital cameras.
    <img src="http://image.slidesharecdn.com/chernodubkharkivaiclubv03post-150701123201-lva1-app6891/95/details-of-lazy-deep-learning-for-images-recognition-in-zz-photo-app-41-638.jpg?cb=1435754072" width=600 />
* Another version of cascadesuses the earlier models to implement a sort of <font color="red">hard attention mechanism</font>
    <img src="http://cdn-ak.f.st-hatena.com/images/fotolife/P/PDFangeltop1/20160205/20160205113146.png" width=600 />
    <img src="https://qph.is.quoracdn.net/main-qimg-148b032baba54d1db01e3b8c39168f3a?convert_to_webp=true" width=600 />

<font color="red">Decision trees</font> themselves are an example of dynamic structure, because eachnode in the tree determines which of its subtrees should be evaluated for each input.

<img src="http://www.time-management-guide.com/images/decision-tree.gif" width=600 />

In the same spirit, one can use a neural network, called the gater to select which one out of several expert networks will be used to compute the output, given the current input.
* The ﬁrst version of this idea is called the <font color="red">mixture of experts</font> (Nowlan,1990; Jacobs et al., 1991), in which the gater outputs a set of probabilities orweights (obtained via a softmax nonlinearity), one per expert, and the ﬁnal output is obtained by the weighted combination of the output of the experts. 
* In that case, the use of the gater does not oﬀer a reduction in computational cost, but if a single expert is chosen by the gater for each example, we obtain the <font color="red">hard mixtureof experts</font> (Collobert et al., 2001, 2002), which can considerably accelerate training and inference time.

<img src="http://www.frontiersin.org/files/Articles/119429/fnhum-08-00971-HTML/image_m/fnhum-08-00971-g004.jpg" width=600 />

Another kind of dynamic structure is a switch, where a hidden unit can receive input from diﬀerent units depending on the context. This <font color="red">dynamic routing</font> approach can be interpreted as an attention mechanism

<img src="http://mlcenter.postech.ac.kr/files/attach/images/436/279/374/de62e0e6f848f9b5ded7c789c35483d4.png" width=600 />

<font color="red">One major obstacle to using dynamically structured systems is the decreased degree of parallelism that results from the system following diﬀerent code branches for diﬀerent inputs</font>.

## 12.1.6 Specialized Hardware Implementations of Deep Networks

# 12.2 Computer Vision
* 12.2.1 Preprocessing
    - 12.2.1.1 Contrast Normalization
    - 12.2.1.2 Dataset Augmentation

#### 참고
* [11] ImageNet http://image-net.org/
* [12] ILSVRC 2015 - http://image-net.org/challenges/ilsvrc+mscoco2015
* [13] ILSVRC 2014 - http://image-net.org/challenges/LSVRC/2014/eccv2014

## 12.2.1 Preprocessing

#### 참고
* [14] Intro to Computer Vision, historical context - http://cs231n.stanford.edu/slides/winter1516_lecture1.pdf
* [2] Training ConvNets in practice : Data augmentation, transfer learning, Distributed training, CPU/GPU bottlenecks, Efficient convolutions - http://cs231n.stanford.edu/slides/winter1516_lecture11.pdf

Computer vision usually requires relatively little of this kind of preprocessing.
* pixel value range
    - The images should be standardized so that their pixels all lie in the same,reasonable range, like [0,1] or [-1, 1]. 
    - Mixing images that lie in [0,1] with imagesthat lie in [0, 255] will usually result in failure.
    - <font color="red">Formatting images to have the samescale is the only kind of preprocessing that is strictly necessary</font>.
* image size 
    - Many computer vision architectures require images of a standard size, so images must be cropped or scaled to ﬁt that size.
    - <font color="blue">However, even this rescaling is not always strictly necessary</font>.
        - Some convolutional models accept variably-sized inputs and dynamically adjustthe size of their pooling regions to keep the output size constant.
        - Other convolutional models have variable-sized output that automaticallyscales in size with the input, such as models that denoise or label each pixel in animage
* Dataset augmentation
    - Dataset augmentation may be seen as a way of <font color="red">preprocessing the training set only</font>.
    - Dataset augmentation is an excellent way to <font color="red">reduce the generalization error</font> of most computer vision models.
    -  A related idea applicable at test time is to show the model many diﬀerent versions of the same input 
        - (for example, the same image cropped at slightly diﬀerent locations) 
    - and have the diﬀerent instantiations of the model vote to determine the output. 
    - This latter idea can be interpreted as an <font color="red">ensemble approach</font>, 
    - and helps to reduce generalization error.

### 12.2.1.1 Contrast Normalization

<img src="figures/cap12.1.png" width=600 />

<img src="figures/cap12.2.png" width=600 />

<img src="figures/cap12.3.png" width=600 />

<img src="figures/cap12.4.png" width=600 />

### 12.2.1.2 Dataset Augmentation

# 12.3 Speech Recognition

<img src="figures/cap12.5.png" width=600 />

# 12.4 Natural Language Processing
* 12.4.1 n-grams
* 12.4.2 Neural Language Models
* 12.4.3 High-Dimensional Outputs
* 12.4.4 Combining Neural Language Models with n-grams
* 12.4.5 Neural Machine Translation
* 12.4.6 Historical Perspective

## 12.4.1 n-grams

<img src="figures/cap12.6.png" width=600 />

<img src="figures/cap12.7.png" width=600 />

<img src="figures/cap12.8.png" width=600 />

## 12.4.2 Neural Language Models

<img src="figures/cap12.9.png" width=600 />

## 12.4.3 High-Dimensional Outputs
* 12.4.3.1 Use of a Short List
* 12.4.3.2 Hierarchical Softmax
* 12.4.3.3 Importance Sampling
* 12.4.3.4 Noise-Contrastive Estimation and Ranking Loss

<img src="figures/cap12.10.png" width=600 />

### 12.4.3.1 Use of a Short List

<img src="figures/cap12.11.png" width=600 />

### 12.4.3.2 Hierarchical Softmax

<img src="figures/cap12.12.png" width=600 />

### 12.4.3.3 Importance Sampling

<img src="figures/cap12.13.png" width=600 />

<img src="figures/cap12.14.png" width=600 />

<img src="figures/cap12.15.png" width=600 />

### 12.4.3.4 Noise-Contrastive Estimation and Ranking Loss

<img src="figures/cap12.16.png" width=600 />

## 12.4.4 Combining Neural Language Models with n-grams

## 12.4.5 Neural Machine Translation
* 12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data

<img src="figures/cap12.17.png" width=600 />

### 12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data

<img src="figures/cap12.18.png" width=600 />

## 12.4.6 Historical Perspective

# 12.5 Other Applications
* 12.5.1 Recommender Systems
    - 12.5.1.1 Exploration Versus Exploitation
    - 12.5.2 Knowledge Representation, Reasoning and Question Answering        

## 12.5.1 Recommender Systems

<img src="figures/cap12.19.png" width=600 />

### 12.5.1.1 Exploration Versus Exploitation

### 12.5.2 Knowledge Representation, Reasoning and Question Answering
* 12.5.2.1 Knowledge, Relations and Question Answering

#### 12.5.2.1 Knowledge, Relations and Question Answering

<img src="figures/cap12.20.png" width=600 />

# 참고자료

* [1] deeplearning book - http://www.deeplearningbook.org
* [2] Training ConvNets in practice : Data augmentation, transfer learning, Distributed training, CPU/GPU bottlenecks, Efficient convolutions - http://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
* [3] Toward Scalable Deep Learning -  http://mlcenter.postech.ac.kr/files/attach/workshop_fall_2015/%EC%84%9C%EC%9A%B8%EB%8C%80%ED%95%99%EA%B5%90_%EC%9C%A4%EC%84%B1%EB%A1%9C_%EA%B5%90%EC%88%98_v1.pdf
* [4] Large Scale Distributed Deep Networks - http://www.slideshare.net/HiroyukiVincentYamaz/large-scale-distributed-deep-networks
* [5] Large Scale Deep Learning Jeff Dean - http://www.slideshare.net/hustwj/cikm-keynotenov2014
* [6] DeepDist : Lightning-Fast Deep Learning on Spark Via parallel stochastic gradient updates - http://deepdist.com/
* [7] Techniques for Efficient Implementation of Deep Neural Networks - http://www.slideshare.net/embeddedvision/techniques-for-efficient-implementation-of-deep-neural-networks-a-presentation-from-stanford
* [8] Compressing CNN for Mobile Device - http://mlcenter.postech.ac.kr/files/attach/workshop_fall_2015/%EC%82%BC%EC%84%B1%EC%A0%84%EC%9E%90_%EA%B9%80%EC%9A%A9%EB%8D%95_%EB%B0%95%EC%82%AC.pdf
* [9] ATTENTION MECHANISM - https://blog.heuritech.com/2016/01/20/attention-mechanism/
* [10] Dynamic Scheduler for Scaling up Deep Learning - http://mlcenter.postech.ac.kr/deep_learning
* [11] ImageNet http://image-net.org/
* [12] ILSVRC 2015 -  http://image-net.org/challenges/ilsvrc+mscoco2015
* [13] ILSVRC 2014 - http://image-net.org/challenges/LSVRC/2014/eccv2014
* [14] Intro to Computer Vision, historical context - http://cs231n.stanford.edu/slides/winter1516_lecture1.pdf