# Chapter 12. Applications

* 싸이그래머 / QGM : 파트 4 - 딥러닝 [1]
* 김무성

# Contents
* 12.1 Large Scale Deep Learning
    - 12.1.1 Fast CPU Implementations
    - 12.1.2 GPU Implementations
    - 12.1.3 Large Scale Distributed Implementations
    - 12.1.4 Model Compression
    - 12.1.5 Dynamic Structure
    - 12.1.6 Specialized Hardware Implementations of Deep Networks
* 12.2 Computer Vision
    - 12.2.1 Preprocessing
        - 12.2.1.1 Contrast Normalization
        - 12.2.1.2 Dataset Augmentation
* 12.3 Speech Recognition
* 12.4 Natural Language Processing
    - 12.4.1 n-grams
    - 12.4.2 Neural Language Models
    - 12.4.3 High-Dimensional Outputs
        - 12.4.3.1 Use of a Short List
        - 12.4.3.2 Hierarchical Softmax
        - 12.4.3.3 Importance Sampling
        - 12.4.3.4 Noise-Contrastive Estimation and Ranking Loss
    - 12.4.4 Combining Neural Language Models with n-grams
    - 12.4.5 Neural Machine Translation
        - 12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data
    - 12.4.6 Historical Perspective
* 12.5 Other Applications
    - 12.5.1 Recommender Systems
        - 12.5.1.1 Exploration Versus Exploitation
        - 12.5.2 Knowledge Representation, Reasoning and Question Answering
            - 12.5.2.1 Knowledge, Relations and Question Answering

In this chapter, we describe 
* how to use deep learning to solve applications in 
    - computer vision, 
    - speech recognition, 
    - natural language processing, and 
    - other applicationareas of commercial interest. 
* We begin by discussing 
    - the large scale neuralnetwork implementations 
        - required for most serious AI applications. 
* Next, we review several speciﬁc application areas that 
    - deep learning has been used to solve. 
    - While one goal of deep learning is to design algorithms that are capable of solving a broad variety of tasks, so far some degree of specialization is needed.

# 12.1 Large Scale Deep Learning
* 12.1.1 Fast CPU Implementations
* 12.1.2 GPU Implementations
* 12.1.3 Large Scale Distributed Implementations
* 12.1.4 Model Compression
* 12.1.5 Dynamic Structure
* 12.1.6 Specialized Hardware Implementations of Deep Networks

#### 참고
* [2] Training ConvNets in practice : Data augmentation, transfer learning, Distributed training, CPU/GPU bottlenecks, Efficient convolutions - http://cs231n.stanford.edu/slides/winter1516_lecture11.pdf

Deep learning is based on the philosophy of connectionism: 
* while an individual biological neuron or 
* an individual feature in a machine learning model is
    - not intelligent, 
* a large population of these neurons or 
* features acting together can 
    - exhibit intelligent behavior.

<font color="red">It truly is important to emphasize the fact that the number of neurons must be large</font>. 
* One of the key factors responsible for the improvement in neural network’s accuracy and the improvement of the complexity of tasks they can solve between the 1980s and today is the dramatic increase in the size of the networks we use.

<img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part1/study01/dml01/figures/fig1.6.png" width=600 />

<img src="http://nbviewer.jupyter.org/github/songorithm/ML/blob/master/part1/study01/dml01/figures/fig1.7.png" width=600 />

## 12.1.1 Fast CPU Implementations

* Traditionally, neural networks were trained using the CPU of a single machine.
    - Today, this approach is generally considered insuﬃcient. 
* We now mostly use 
    - GPU computing or 
    - the CPUs of many machines networked together. 
* Before moving to these expensive setups, 
    - researchers worked hard to demonstrate that 
        - CPUs could not manage 
            - the high computational workload 
                - required by neural networks.
* A description of how to implement eﬃcient numerical CPU code is beyond the scope of this book, but <font color="red">we emphasize here that careful implementation for speciﬁc CPU families can yield large improvements</font>.
    - For example, in 2011, the best CPUs available could run neural network workloads faster 
        - when using ﬁxed-point arithmetic 
            - rather than ﬂoating-point arithmetic. 
    - By creating a carefully tuned ﬁxed-point implementation,
        - Vanhoucke et al. (2011) 
        - obtained a 3×speedup over a strong ﬂoating-point system.
* Other strategies, 
    - besides choosing whether to use 
        - ﬁxed or 
        - ﬂoating point,
    - include 
        - optimizing data structures 
            - to avoid cache misses 
        - and using vector instructions. 
* <font color="red">Many machine learning researchers neglect these implementation details</font>, but when the performance of an implementation restricts
    - the size of the model, 
    - the accuracy of the model suﬀers.

## 12.1.2 GPU Implementations

* Most modern neural network implementations are based on graphics processing units. Graphics processing units (GPUs) are specialized hardware components that were originally developed for graphics applications.

* Neural networks usually involve large and numerous buﬀers of
    - parameters, 
    - activation values, and 
    - gradient values,
    - each of which must be completely updated 
        - during every step of training. 
* These buﬀers are large enough to fall outside 
    - the cache of a traditional desktop computer 
        - so the memory bandwidth of the system often 
            - becomes the rate limiting factor.
* GPUs oﬀer a compelling advantage over CPUs 
    - due to their high memory bandwidth.
* Neural network training algorithms typically do not involve 
    - much branching or 
    - sophisticated control, 
    - so they are appropriate for GPU hardware. 
* Since neural networks can be divided into 
    - multiple individual “neurons” that 
        - can be processed independently 
            - from the other neurons in the same layer, 
    - neural networks easily beneﬁt from 
        - <font color="red">the parallelism of GPU computing</font>.

Due to the diﬃculty of writing high performance GPU code, researchers should structure their workﬂow to avoid needing to write new GPU code in order to test new models or algorithms.
* Typically, one can do this by building a software library of 
    - high performance operations like 
        - convolution and 
        - matrix multiplication, 
    - then specifying models in terms of calls to this library of operations.
        - Pylearn2
        - Theano
        - cuda-convnet
        - TensorFlow
        - Torch

## 12.1.3 Large Scale Distributed Implementations

## 12.1.4 Model Compression

## 12.1.5 Dynamic Structure

## 12.1.6 Specialized Hardware Implementations of Deep Networks

# 12.2 Computer Vision
* 12.2.1 Preprocessing
    - 12.2.1.1 Contrast Normalization
    - 12.2.1.2 Dataset Augmentation

## 12.2.1 Preprocessing

### 12.2.1.1 Contrast Normalization

<img src="figures/cap12.1.png" width=600 />

<img src="figures/cap12.2.png" width=600 />

<img src="figures/cap12.3.png" width=600 />

<img src="figures/cap12.4.png" width=600 />

### 12.2.1.2 Dataset Augmentation

# 12.3 Speech Recognition

<img src="figures/cap12.5.png" width=600 />

# 12.4 Natural Language Processing
* 12.4.1 n-grams
* 12.4.2 Neural Language Models
* 12.4.3 High-Dimensional Outputs
* 12.4.4 Combining Neural Language Models with n-grams
* 12.4.5 Neural Machine Translation
* 12.4.6 Historical Perspective

## 12.4.1 n-grams

<img src="figures/cap12.6.png" width=600 />

<img src="figures/cap12.7.png" width=600 />

<img src="figures/cap12.8.png" width=600 />

## 12.4.2 Neural Language Models

<img src="figures/cap12.9.png" width=600 />

## 12.4.3 High-Dimensional Outputs
* 12.4.3.1 Use of a Short List
* 12.4.3.2 Hierarchical Softmax
* 12.4.3.3 Importance Sampling
* 12.4.3.4 Noise-Contrastive Estimation and Ranking Loss

<img src="figures/cap12.10.png" width=600 />

### 12.4.3.1 Use of a Short List

<img src="figures/cap12.11.png" width=600 />

### 12.4.3.2 Hierarchical Softmax

<img src="figures/cap12.12.png" width=600 />

### 12.4.3.3 Importance Sampling

<img src="figures/cap12.13.png" width=600 />

<img src="figures/cap12.14.png" width=600 />

<img src="figures/cap12.15.png" width=600 />

### 12.4.3.4 Noise-Contrastive Estimation and Ranking Loss

<img src="figures/cap12.16.png" width=600 />

## 12.4.4 Combining Neural Language Models with n-grams

## 12.4.5 Neural Machine Translation
* 12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data

<img src="figures/cap12.17.png" width=600 />

### 12.4.5.1 Using an Attention Mechanism and Aligning Pieces of Data

<img src="figures/cap12.18.png" width=600 />

## 12.4.6 Historical Perspective

# 12.5 Other Applications
* 12.5.1 Recommender Systems
    - 12.5.1.1 Exploration Versus Exploitation
    - 12.5.2 Knowledge Representation, Reasoning and Question Answering        

## 12.5.1 Recommender Systems

<img src="figures/cap12.19.png" width=600 />

### 12.5.1.1 Exploration Versus Exploitation

### 12.5.2 Knowledge Representation, Reasoning and Question Answering
* 12.5.2.1 Knowledge, Relations and Question Answering

#### 12.5.2.1 Knowledge, Relations and Question Answering

<img src="figures/cap12.20.png" width=600 />

# 참고자료

* [1] deeplearning book - http://www.deeplearningbook.org
* [2] Training ConvNets in practice : Data augmentation, transfer learning, Distributed training, CPU/GPU bottlenecks, Efficient convolutions - http://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
* [3] Toward Scalable Deep Learning -  http://mlcenter.postech.ac.kr/files/attach/workshop_fall_2015/%EC%84%9C%EC%9A%B8%EB%8C%80%ED%95%99%EA%B5%90_%EC%9C%A4%EC%84%B1%EB%A1%9C_%EA%B5%90%EC%88%98_v1.pdf