# TensorFlow 中的优化器

> 这里主要介绍基于梯度下降的优化方法及其变种

## 梯度下降
- 全批量梯度下降
  - $\theta = \theta - \eta \cdot \nabla _ { \theta } J ( \theta )$
  - 计算所有样本的总误差，根据总误差更新权值
  - 在大数据集上训练速度很慢，可能包含冗余数据
  
- 随机梯度下降(SGD, stochastic gradient descent)
  - $\theta = \theta - \eta \cdot \nabla _ { \theta } J \left( \theta ; x ^ { ( i ) } ; y ^ { ( i ) } \right)$
  - 是一种使用批量(batch)的方法，本质上是喂数据的策略不同，一次使用一个样本来进行计算并更新迭代, 批量大小为1
  - 更新过程会有大量噪声，每次迭代不一定是向减少损失的方向前进 

- 小批量梯度下降(mini-batch SGD)
  - $\theta = \theta - \eta \cdot \nabla _ { \theta } J \left( \theta ; x ^ { ( i : i + n ) } ; y ^ { ( i : i + n ) } \right)$
  - 是标准和随机梯度下降方法的折中，小批量通常包含10-1000个随机选择的样本
  - 可以减少SGD过程中杂乱样本数量，比全批量更高效


In [1]:
import tensorflow as tf

# <---! def common variables !--->
y = tf.constant(3, dtype=tf.float32)
x = tf.placeholder(dtype=tf.float32)
w = tf.Variable(2, dtype=tf.float32)

# <---! def common operations !--->
p = w * x
# Losses
losses = tf.square(p - y)
# Gradients
grad = tf.gradients(losses, w)

测试梯度下降法

In [3]:
def test_gradient():
    global losses, w, grad
    # Learning rate
    lr = tf.constant(0.3, dtype=tf.float32)

    init = tf.global_variables_initializer()
    # Update the value of w
    update = tf.assign(w, w - lr * grad[0])
    with tf.Session() as sess:
        print('--' * 20, 'Test gradient', '--' * 20)
        sess.run(init)
        print(sess.run([grad, p, w], {x: 1}))

        for i in range(20):
            w_, g_, l_ = sess.run([w, grad, losses], feed_dict={x: 1})
            print('Iteration: {}, w:{}, g: {}, loss: {}'.format(i, w_, g_, l_))

            _ = sess.run(update, feed_dict={x: 1})
test_gradient()

---------------------------------------- Test gradient ----------------------------------------
[[-2.0], 2.0, 2.0]
Iteration: 0, w:2.0, g: [-2.0], loss: 1.0
Iteration: 1, w:2.5999999046325684, g: [-0.8000002], loss: 0.16000007092952728
Iteration: 2, w:2.8399999141693115, g: [-0.32000017], loss: 0.025600027292966843
Iteration: 3, w:2.935999870300293, g: [-0.12800026], loss: 0.004096016753464937
Iteration: 4, w:2.974400043487549, g: [-0.051199913], loss: 0.0006553577841259539
Iteration: 5, w:2.989759922027588, g: [-0.020480156], loss: 0.00010485919483471662
Iteration: 6, w:2.995903968811035, g: [-0.008192062], loss: 1.677747059147805e-05
Iteration: 7, w:2.998361587524414, g: [-0.003276825], loss: 2.6843954401556402e-06
Iteration: 8, w:2.99934458732605, g: [-0.0013108253], loss: 4.2956577317454503e-07
Iteration: 9, w:2.9997377395629883, g: [-0.0005245209], loss: 6.87805368215777e-08
Iteration: 10, w:2.9998950958251953, g: [-0.00020980835], loss: 1.1004885891452432e-08
Iteration: 11, w:2.9

## 动量优化器(Momentum)
- 不使用momentum
![sgd_without_momentum](./resources/sgd_without_momentum.gif)
SGD随机梯度下降在更新的过程中会出现更大扰动，这种情况常常出现在局部最优点。


- 使用momentum
  - $\begin{aligned} v _ { t } & = \gamma v _ { t - 1 } + \eta \nabla _ { \theta } J ( \theta ) \\ \theta & = \theta - v _ { t } \end{aligned}$
![sgd_with_momentum](./resources/sgd_with_momentum.gif)
  - Momentum动量帮助随机梯度下降在相关的帮助减少损失的方向加速，而抑制扰动。
  - 我们为momentum添加了一个分数参数$\gamma$，这个值通常为0.9或近似值，使得SGD在正确的方向上下降得更快。
  

测试Momentum优化器

In [6]:
def test_momentum_api():
    global losses, w, grad

    # momentum
    mu = 0.9
    lr = tf.constant(0.03, dtype=tf.float32)
    # init = tf.global_variables_initializer()

    # Update the var list to optimize losses
    update = tf.train.MomentumOptimizer(lr, mu).minimize(losses)
    with tf.Session() as sess:
        print('--' * 20, 'Test momentum api', '--' * 20)
        # sess.run(init)
        sess.run(tf.global_variables_initializer())
        print(sess.run([grad, p, w], {x: 1}))
        for i in range(20):
            w_, g_, l_ = sess.run([w, grad, losses], feed_dict={x: 1})
            print('Iteration: {}, w:{}, g: {}, loss: {}'.format(i, w_, g_, l_))

            _ = sess.run([update], feed_dict={x: 1})

test_momentum_api()

---------------------------------------- Test momentum api ----------------------------------------
[[-2.0], 2.0, 2.0]
Iteration: 0, w:2.0, g: [-2.0], loss: 1.0
Iteration: 1, w:2.5999999046325684, g: [-0.8000002], loss: 0.16000007092952728
Iteration: 2, w:3.380000114440918, g: [0.7600002], loss: 0.14440008997917175
Iteration: 3, w:3.8540000915527344, g: [1.7080002], loss: 0.7293161749839783
Iteration: 4, w:3.768199920654297, g: [1.5363998], loss: 0.5901311039924622
Iteration: 5, w:3.230059862136841, g: [0.46011972], loss: 0.05292753875255585
Iteration: 6, w:2.6076979637145996, g: [-0.7846041], loss: 0.1539008915424347
Iteration: 7, w:2.2829535007476807, g: [-1.434093], loss: 0.5141556859016418
Iteration: 8, w:2.4209113121032715, g: [-1.1581774], loss: 0.33534371852874756
Iteration: 9, w:2.892526626586914, g: [-0.21494675], loss: 0.01155052613466978
Iteration: 10, w:3.3814644813537598, g: [0.76292896], loss: 0.14551514387130737
Iteration: 11, w:3.592629909515381, g: [1.1852598], loss: 0

In [7]:
def test_momentum_hand():
    global losses, w, grad
    mu = 0.9
    lr = tf.constant(0.03, dtype=tf.float32)
    v = tf.Variable(0, dtype=tf.float32)
    init = tf.global_variables_initializer()

    # update
    update1 = tf.assign(v, mu * v + grad[0] * lr)
    update2 = tf.assign(w, w - v)
    with tf.Session() as sess:
        print('--' * 20, 'Test momentum hand', '--' * 20)
        sess.run(init)
        print(sess.run([grad, p, w], {x: 1}))
        for i in range(20):
            w_, g_, l_, v_ = sess.run([w, grad, losses, v], feed_dict={x: 1})
            print('Iteration: {}, w:{}, g: {}, loss: {}, v: {}'.format(i, w_, g_, l_, v_))
            _ = sess.run([update1], feed_dict={x: 1})
            _ = sess.run([update2], feed_dict={x: 1})

test_momentum_hand()

---------------------------------------- Test momentum hand ----------------------------------------
[[-2.0], 2.0, 2.0]
Iteration: 0, w:2.0, g: [-2.0], loss: 1.0, v: 0.0
Iteration: 1, w:2.059999942779541, g: [-1.8800001], loss: 0.883600115776062, v: -0.05999999865889549
Iteration: 2, w:2.1703999042510986, g: [-1.6592002], loss: 0.6882362961769104, v: -0.1103999987244606
Iteration: 3, w:2.319535970687866, g: [-1.360928], loss: 0.4630312919616699, v: -0.1491360068321228
Iteration: 4, w:2.494586229324341, g: [-1.0108275], loss: 0.2554430663585663, v: -0.17505024373531342
Iteration: 5, w:2.6824562549591064, g: [-0.6350875], loss: 0.10083402693271637, v: -0.18787004053592682
Iteration: 6, w:2.870591878890991, g: [-0.25881624], loss: 0.016746461391448975, v: -0.18813565373420715
Iteration: 7, w:3.0476784706115723, g: [0.09535694], loss: 0.002273236634209752, v: -0.17708657681941986
Iteration: 8, w:3.204195737838745, g: [0.40839148], loss: 0.04169590026140213, v: -0.15651720762252808
Iteratio

## Adagrad优化器
Adagrad使学习率适应参数，为那些经常出现的特征执行更小的更新(低学习率)，为不常出现的特征执行更大的更新(高学习率)，因此它很适合用以处理稀疏数据。

### 更新过程
$g _ { t , i } = \nabla _ { \theta } J \left( \theta _ { t , i } \right)$

$g_{t,i}$表示t时刻对$\theta_i$的偏导

$\theta _ { t + 1 } = \theta _ { t } - \frac { \eta } { \sqrt { G _ { t } + \epsilon } } \odot g _ { t }$

$G _ { t } \in \mathbb { R } ^ { d \times d }$是对角矩阵，为$\theta_{t,i}$的平方，$\epsilon$是平滑参数，避免分母为0，常取值为`1e-8`

### 优点和缺点
- **优点**: 不需要手动调节学习率，通常默认0.01
- **缺点**: 最大的缺点就是在分母里不断累加的平方项，可能会使得最后的学习率很小，可以通过Adadelta方法来解决。

In [9]:
def test_adagrad_api():
    global losses, w, grad
    lr = tf.constant(0.6, dtype=tf.float32)
    update = tf.train.AdagradOptimizer(lr).minimize(losses)
    init = tf.global_variables_initializer()
    with tf.Session() as sess:
        print('--' * 20, 'Test adagrad api', '--' * 20)
        sess.run(init)
        print(sess.run([grad, p, w], {x: 1}))
        for i in range(20):
            w_, g_, l_ = sess.run([w, grad, losses], feed_dict={x: 1})
            print('Iteration: {}, w:{}, g: {}, loss: {}'.format(i, w_, g_, l_))
            _ = sess.run([update], feed_dict={x: 1})
            
test_adagrad_api()

---------------------------------------- Test adagrad api ----------------------------------------
[[-2.0], 2.0, 2.0]
Iteration: 0, w:2.0, g: [-2.0], loss: 1.0
Iteration: 1, w:2.592637777328491, g: [-0.81472445], loss: 0.16594398021697998
Iteration: 2, w:2.816606044769287, g: [-0.3667879], loss: 0.03363334387540817
Iteration: 3, w:2.916041851043701, g: [-0.1679163], loss: 0.007048970554023981
Iteration: 4, w:2.9614334106445312, g: [-0.07713318], loss: 0.0014873817563056946
Iteration: 5, w:2.982271671295166, g: [-0.035456657], loss: 0.00031429363298229873
Iteration: 6, w:2.991849422454834, g: [-0.016301155], loss: 6.64319159113802e-05
Iteration: 7, w:2.9962525367736816, g: [-0.0074949265], loss: 1.4043480405234732e-05
Iteration: 8, w:2.998276948928833, g: [-0.0034461021], loss: 2.9689049370063003e-06
Iteration: 9, w:2.9992077350616455, g: [-0.0015845299], loss: 6.276837325458473e-07
Iteration: 10, w:2.999635696411133, g: [-0.0007286072], loss: 1.3271710486151278e-07
Iteration: 11, w:2.9

In [13]:
def test_adagrad_hand():
    global losses, w, grad
    # Second_derivation
    sd = tf.Variable(0, dtype=tf.float32)
    lr = tf.constant(0.6, dtype=tf.float32)
    regular = 1e-8

    update1 = tf.assign_add(sd, tf.square(grad[0]))
    g_final = lr * grad[0] / (tf.sqrt(sd) + regular)
    update2 = tf.assign(w, w - g_final)
    init = tf.global_variables_initializer()
    with tf.Session() as sess:
        print('--' * 20, 'Test adagrad hand', '--' * 20)
        sess.run(init)
        print(sess.run([grad, p, w], {x: 1}))
        for i in range(20):
            _ = sess.run([update1], feed_dict={x: 1})
            w_, g_, l_, sd_ = sess.run([w, grad, losses, sd], feed_dict={x: 1})
            print('Iteration: {}, w:{}, g: {}, loss: {}, sd:{}'.format(i, w_, g_, l_, sd_))
            _ = sess.run([update2], feed_dict={x: 1})
            
test_adagrad_hand()

---------------------------------------- Test adagrad hand ----------------------------------------
[[-2.0], 2.0, 2.0]
Iteration: 0, w:2.0, g: [-2.0], loss: 1.0, sd:4.0
Iteration: 1, w:2.5999999046325684, g: [-0.8000002], loss: 0.16000007092952728, sd:4.640000343322754
Iteration: 2, w:2.8228342533111572, g: [-0.3543315], loss: 0.031387701630592346, sd:4.7655510902404785
Iteration: 3, w:2.920222043991089, g: [-0.15955591], loss: 0.006364522036164999, sd:4.791008949279785
Iteration: 4, w:2.963959217071533, g: [-0.072081566], loss: 0.0012989380629733205, sd:4.796204566955566
Iteration: 5, w:2.9837074279785156, g: [-0.032585144], loss: 0.0002654478885233402, sd:4.797266483306885
Iteration: 6, w:2.992633819580078, g: [-0.014732361], loss: 5.426061397884041e-05, sd:4.797483444213867
Iteration: 7, w:2.9966695308685303, g: [-0.0066609383], loss: 1.1092024578829296e-05, sd:4.79752779006958
Iteration: 8, w:2.9984941482543945, g: [-0.0030117035], loss: 2.2675894797430374e-06, sd:4.797536849975586

## 参考
1. [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/index.html)