# 2. Before we begin: the mathematical building blocks of neural networks

* 어바웃 파이썬 : 딥러닝 with Keras [1, 2]
* 김무성

# 차례
* 2.1 A first look at a neural network
* 2.2 Data representations for neural networks
* 2.3 The gears of neural networks: tensor operations
    - 2.3.1 Element-wise operations
    - 2.3.2 Broadcasting
    - 2.3.3 Tensor dot
    - 2.3.4 Tensor reshaping
    - <font color="red">2.3.5 Geometric interpretation of tensor operations</font>
    - <font color="red">2.3.6 A geometric interpretation of deep learning</font>
* <font color="red">2.4 The engine of neural networks: gradient-based optimization </font>
* <font color="red">2.5 Looking back at our first example</font>

## 2.3.5 Geometric interpretation of tensor operations

In [None]:
# A 
# 해보자

<img src="figures/cap01.png" width=600 />

In [None]:
# B = [1, 0.25]
# A와 B를 더해보자

<img src="figures/cap02.png" width=600 />

* In general, elementary geometric operations such as affine transformations, rotations, scaling, and so on can be expressed as tensor operations. 
* For instance, a rotation of a 2D vector by an angle theta can be achieved via a dot product with a 2 × 2 matrix R = [u, v], 
    - where u and v are both vectors of the plane: 
        - u = [cos(theta), sin(theta)] and 
        - v = [-sin(theta), cos(theta)].

<img src="http://www.sharetechnote.com/image/EngMath_Matrix_Affin_Rotate.PNG" width=600 />

#### 참고

>* [6] Affine Mapping/Affine Transformation - http://www.sharetechnote.com/html/EngMath_Matrix_AffineMapping.html

## 2.3.6 A geometric interpretation of deep learning

You just learned that <font color="blue">neural networks consist entirely of chains of tensor operations</font> and that all of these <font color="green">tensor operations are just geometric transformations of the input data</font>. It follows that <font color="red">you can interpret a neural network as a very complex geometric transformation in a high-dimensional space, implemented via a long series of simple steps</font>.

<img src="figures/cap03.png" width=600 />

<img src="https://raw.githubusercontent.com/psygrammer/about_python_dl/master/keras/01_intro/figures/cap06.png" width=600 />

# 2.4 The engine of neural networks: gradient-based optimization 
* 2.4.1 What’s a derivative? 
* 2.4.2 Derivative of a tensor operation: the gradient 
* 2.4.3 Stochastic gradient descent 
* 2.4.4 Chaining derivatives: the Backpropagation algorithm

#### 참고

>* [3] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / Lecture2 Image Classsification Pipeline - http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture2.pdf
* [4] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / Lecture 3: Loss Functions and Optimization - http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture3.pdf
* [7] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / optimization notes - http://cs231n.github.io/optimization-1/

<img src="https://raw.githubusercontent.com/psygrammer/about_python_dl/master/keras/01_intro/figures/cap05.png" width=600 />

<img src="https://raw.githubusercontent.com/psygrammer/about_python_dl/master/keras/01_intro/figures/cap09.png" width=600 />

```
output = relu(dot(W, input) + b)
```

##### A first very bad idea solution: Random search
* [7] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / optimization notes - http://cs231n.github.io/optimization-1/

```python
# assume X_train is the data where each column is an example (e.g. 3073 x 50,000)
# assume Y_train are the labels (e.g. 1D array of 50,000)
# assume the function L evaluates the loss function

bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
  W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
  loss = L(X_train, Y_train, W) # get the loss over the entire training set
  if loss < bestloss: # keep track of the best solution
    bestloss = loss
    bestW = W
  print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)

# prints:
# in attempt 0 the loss was 9.401632, best 9.401632
# in attempt 1 the loss was 8.959668, best 8.959668
# in attempt 2 the loss was 9.044034, best 8.959668
# in attempt 3 the loss was 9.278948, best 8.959668
# in attempt 4 the loss was 8.857370, best 8.857370
# in attempt 5 the loss was 8.943151, best 8.857370
# in attempt 6 the loss was 8.605604, best 8.605604
# ... (trunctated: continues for 1000 lines)
```

A much better approach is to take advantage of the fact that all operations used in the network are differentiable, and compute the gradient of the loss with regard to the network’s coefficients. You can then move the coefficients in the opposite direction from the gradient, thus decreasing the loss.
* <font color="blue">If you already know what</font> <font color="red">differentiable</font> <font color="blue">means and what a <font color="red">gradient</font> <font color="blue">is, you can skip to section 2.4.3. Otherwise, the following two sections will help you understand these concepts.</font>

## 2.4.1 What’s a derivative?

Let’s say you increase x by a small factor epsilon_x: this results in a small epsilon_y change to y:
```
f(x + epsilon_x) = y + epsilon_y
```

it’s possible to approxi- mate f as a linear function of slope a, so that epsilon_y becomes a * epsilon_x:

```
f(x + epsilon_x) = y + a * epsilon_x
```

* Obviously, this linear approximation is valid only when x is close enough to p.
* The slope a is called the derivative of f in p.

<img src="figures/cap04.png" width=600 />

For every differentiable function f(x) (differentiable means “can be derived”: for example, smooth, continuous functions can be derived), there exists a derivative function f'(x) that maps values of x to the slope of the local linear approximation of f in those points.

## 2.4.2 Derivative of a tensor operation: the gradient

<font color="red">A gradient is the derivative of a tensor operation</font>. It’s the generalization of the concept of derivatives to functions of multidimensional inputs: that is, to functions that take tensors as inputs.

<img src="http://cfile29.uf.tistory.com/image/99E6363359D86A8805C292" />

#### Computing the gradient

* [7] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / optimization notes - http://cs231n.github.io/optimization-1/

```python
def eval_numerical_gradient(f, x):
  """ 
  a naive implementation of numerical gradient of f at x 
  - f should be a function that takes a single argument
  - x is the point (numpy array) to evaluate the gradient at
  """ 

  fx = f(x) # evaluate function value at original point
  grad = np.zeros(x.shape)
  h = 0.00001

  # iterate over all indexes in x
  it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
  while not it.finished:

    # evaluate function at x+h
    ix = it.multi_index
    old_value = x[ix]
    x[ix] = old_value + h # increment by h
    fxh = f(x) # evalute f(x + h)
    x[ix] = old_value # restore to previous value (very important!)

    # compute the partial derivative
    grad[ix] = (fxh - fx) / h # the slope
    it.iternext() # step to next dimension

  return grad
```

## 2.4.3 Stochastic gradient descent

* Given a differentiable function, it’s theoretically possible to find its minimum analytically: it’s known that a function’s minimum is a point where the derivative is 0, so all you have to do is find all the points where the derivative goes to 0 and check for which of these points the function has the lowest value.
* <font color="red">The term stochastic refers to the fact that each batch of data is drawn at random (stochastic is a scientific synonym of random)</font>.

#### Vanilla Gradient Descent
* [7] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / optimization notes - http://cs231n.github.io/optimization-1/

``` python

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update
```

#### Vanilla Minibatch Gradient Descent

* [7] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / optimization notes - http://cs231n.github.io/optimization-1/

```python

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update
```

<img src="figures/cap05.png" width=600 />

<img src="figures/cap06.png" width=600 />

<img src="figures/cap07.png" width=600 />

<img src="http://imgtec.eetrend.com/sites/imgtec.eetrend.com/files/201706/blog/9908-27874-7.gif"  />

## 2.4.4 Chaining derivatives: the Backpropagation algorithm

#### 참고

> * [5] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / Lecture 4: Backpropagation and Neural Networks - http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture4.pdf


f(W1, W2, W3) = a(W1, b(W2, c(W3)))

chain rule : f(g(x)) = f'(g(x)) * g'(x)

Applying the chain rule to the computation of the gradient values of a neural network gives rise to an algorithm called Backpropagation (also sometimes called reverse-mode differentiation). 

<img src="https://cdn-images-1.medium.com/max/2000/1*q1M7LGiDTirwU-4LcFq7_Q.png" width=600 />

<img src="https://cdn-images-1.medium.com/max/2000/1*FceBJSJ7j8jHjb4TmLV0Ew.png" width=600 />

symbolic differentiation

<img src="http://player.slideplayer.com/47/11696149/data/images/img5.jpg" width=400 />

# 2.5 Looking back at our first example

In [6]:
from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255


Using TensorFlow backend.


In [7]:
from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

In [8]:
network.compile(optimizer='rmsprop',
                loss='categorical_crossentropy',
                metrics=['accuracy'])

In [10]:
from keras.utils import to_categorical

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

In [11]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fd7934feb70>

# 참고자료
* [1] Deep Learning with Python - https://www.manning.com/books/deep-learning-with-python
* [2] Jupyter notebooks for the code samples of the book "Deep Learning with Python" - https://github.com/fchollet/deep-learning-with-python-notebooks
* [3] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / Lecture2 Image Classsification Pipeline - http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture2.pdf
* [4] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / Lecture 3: Loss Functions and Optimization - http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture3.pdf
* [5] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / Lecture 4: Backpropagation and Neural Networks - http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture4.pdf
* [6] Affine Mapping/Affine Transformation - http://www.sharetechnote.com/html/EngMath_Matrix_AffineMapping.html
* [7] Santford's class(2017) / CS231n: Convolutional Neural Networks for Visual Recognition / optimization notes - http://cs231n.github.io/optimization-1/