# Testing Logits Scales

## Introduction

When I was learning how to use CNN to tackle the __[Carvana Competition](https://www.kaggle.com/c/carvana-image-masking-challenge)__ on __[Kaggle](http://www.kaggle.com)__, I came across with __[ENet](https://arxiv.org/pdf/1606.02147.pdf)__ that was used for Semantic Segmentation. I skimmed through a __[paper](https://arxiv.org/pdf/1606.02147.pdf)__ and looked at some __[sample code](https://github.com/kwotsin/TensorFlow-ENet/blob/master/enet.py)__ online,  and I started writing my code for the ENet. If x is my input, then z=ENet(x) is the set of logits I used for the calculation of the sigmoid cross entropy. I was not sure if I have made any mistakes in the codes, but using the Xavier initialization of the weights, the output Z are really large - about 1e40 or more. As ENet is composed of many "bottleneck" layers, I found out that after each pass of these bottleneck layers, the scale of the output is increased by a factor of 10. Thus, I tried to "manually" divide the output by 10 and the ENet finally kinda worked (far from perfectly, but at least the loss function is not NaN anymore and it is decreasing when training.

Because of this, I would like to investigate the effect of training and performance with different scales of logits using the famous MNIST data. (I cannot find anything online about this topics - but I think this is a very common question.)

***
## Preliminary Analysis

Let's start with some theoretical analysis. We let
*  $x$ - the input (`shape = [num_data, 784] or [num_data, 28,28]`). 
*  $z = f(x|\theta)$ - the logits given parameters $\theta$ in a given network architecture $f$. (`z.shape = [num_data, 10]`)
*  $\hat{y} = softmax(z)$ - our prediction probability. (`shape = [num_data, 10]`)
*  $y$ - labels. (`shape = [num_data, 10]` after one hot)

Then the loss function is given by
$$L(x;\theta)=-\sum y_i \log \hat{y}_i$$

To reduce the loss, i.e. to train the model, we looked at the derivatives of $L$ with respect to $\theta$. We first have
$$\dfrac{\partial L}{\partial z} =y- \hat{y}$$

Thus by chain rules, we have
$$\dfrac{\partial L}{\partial \theta} =(y - \hat{y}) \dfrac{\partial }{\partial \theta}\,f(x;\theta)$$

(Note that all of these are indeed vectors and matrices)

Now we are trying to scale the logits, i.e.  let
* $\hat{\tilde{y}} = softmax(\alpha z)$ for some $\alpha > 0$. 

We  then have
$$\dfrac{\partial \tilde{L}}{\partial z} =\alpha(y- \hat{\tilde{y}})$$

This means 
$$\dfrac{\partial \tilde{L}}{\partial \theta} =\alpha(y - \hat{\tilde{y}}) \dfrac{\partial }{\partial \theta}\,f(x;\theta)$$

This means that **if we scale our logits by a factor of $\alpha$, the gradient is also (apparently) scaled by a factor of $\alpha$**. Of course it is *not that simple*, because if we scaled the logits, our probability vectors $\hat{y}$ changed and thus our losses also changed. It is very difficult to analyze how the training would turn out, but we can *try* to ignore the changes in our losses and just think that the gradient is just scaled by a factor of $\alpha$. Since we used the gradient for our learning, **scaling the logits by a factor of $\alpha$ is similar to scaling our learning rate by a factor of $\alpha$**, ignoring the (huge) effects of the changes in the loss function.

## Demo
Thus we can do some experiments to see how the scaling of the logits would affect our training.

We are doing the followings,
-  We are using 98% of our data as training set and 2% of our data 
-  We are investigating several archictecture including a fully connected net and three CNNs.
-  We fix the number of epochs to be 5 (or 10) 
-  We are using simple gradient descent optimizer.
-  Minibatch Size for Gradient Descent is 1024.
-  Our base learning rate is 0.01 when it is not adjusted by the scale. 
-  We will be keeping track of the standard deviation of the logits
-  We are fixing a random state

Note that this notebook is run on __[Kaggle](www.kaggle.com)__ Kernel.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import tensorflow as tf
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.read_csv('../input/train.csv')
df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df.iloc[:,1:],df.iloc[:,0],train_size=0.98, test_size=0.02, random_state=0)
Xtrain=np.array(Xtrain).reshape(-1,28,28,1)
Xtest=np.array(Xtest).reshape(-1,28,28,1)
enc = OneHotEncoder()
ytrain= enc.fit_transform(np.array(ytrain).reshape(-1,1)).toarray()
ytest= enc.transform(np.array(ytest).reshape(-1,1)).toarray()
print(Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape)

(41160, 28, 28, 1) (840, 28, 28, 1) (41160, 10) (840, 10)


In [4]:
#Several network architecture. the input x is the input placeholder and return a logit tensor.
def cnn1(x):
    net = tf.layers.conv2d(x,8,[5, 5],strides=(1,1),padding="valid",activation=tf.tanh)
    net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
    net = tf.layers.conv2d(net,32, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
    net = tf.layers.conv2d(net,1, [1,1], strides=(1,1), padding="same", activation=None)
    n_train = tf.shape(net)[0]
    net = tf.reshape(net, [n_train, 16])
    return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)

def fc(x):
    n_train = tf.shape(x)[0]
    net = tf.reshape(x, [n_train, 784])
    net = tf.contrib.layers.fully_connected(net, 196, activation_fn=tf.nn.relu)
    net = tf.contrib.layers.fully_connected(net, 49, activation_fn=tf.nn.relu)
    return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)

def cnn2(x):
    net = tf.layers.conv2d(x , 4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.conv2d(net,1, [1,1], strides=(1,1), padding="valid", activation=tf.tanh)
    n_train = tf.shape(net)[0]
    net = tf.reshape(net, [n_train, 16])
    return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)

def cnn3(x):
    net = tf.layers.conv2d(x,16,[3, 3],strides=(1,1),padding="valid",activation=tf.tanh)
    net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
    net = tf.layers.conv2d(net,64, [4,4], strides=(1,1), padding="valid", activation=tf.tanh)
    net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
    net = tf.layers.conv2d(net,1, [1,1], strides=(1,1), padding="same", activation=None)
    n_train = tf.shape(net)[0]
    net = tf.reshape(net, [n_train, 25])
    return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)

In [None]:
def runnet(Xtrain, Xtest, ytrain, ytest, func,epochs = 5,scale=1, base_lr = 0.001, learning_rate_adjusted = False, seed=None):
    if func not in [cnn1,cnn2,cnn3,fc]:
        print('Input Function Incorrect!')
        return
    numlist=np.array([0,1,2,3,4,5,6,7,8,9])
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    x = tf.placeholder(tf.float32, shape=[None, 28,28,1])
    y = tf.placeholder(tf.float32, shape = [None, 10])
    lr = tf.placeholder(tf.float32, shape = [])
    out = func(x) * scale
    loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=out))
    train_step = tf.train.AdamOptimizer(lr).minimize(loss)
    bs = 1024
    acclist=[]
    stdlist=[]
    #trainlist=[]
    #testlist=[]
    if learning_rate_adjusted:
        rate = base_lr / scale 
    else:
        rate = base_lr
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        er=sess.run(loss, feed_dict={x:Xtrain[:1], y:ytrain[:1], lr: 0})
        print('Running: Scale={}, BaseLR={}, LR_Adjusted={}, Seed={}'.format(scale, base_lr, learning_rate_adjusted, seed))
        print('Initial Logits StDev: ',np.std(sess.run(out, feed_dict={x:Xtest})))
        for i in range(epochs):
            tic = time.time()
            for j in range(len(Xtrain)//bs):
                _, er=sess.run([train_step, loss], feed_dict={x:Xtrain[j*bs:(j+1)*bs], y:ytrain[j*bs:(j+1)*bs], lr: rate})
                ertest, pred = sess.run([loss,out], feed_dict={x:Xtest, y:ytest})
                stdev = np.std(pred)
                pred=np.argmax(pred, axis=1)
                acc=(pred==np.dot(ytest, numlist)).sum()/len(ytest)*100

                if j%3==0:
                    #print('Training Error: {:8.4f}   Test Error: {:8.4f}   Accuracy: {:5.2f}%'.format(er, ertest, acc))
                    acclist.append(acc)
                    stdlist.append(stdev)
                    #trainlist.append(er)
                    #testlist.append(ertest)
            toc = time.time()
            print('Epoch {}   Training Error: {:8.4f}   Test Error: {:8.4f}   Accuracy: {:5.2f}%   Time: {:6.2f}s'.format(i+1,er, ertest, acc, toc- tic))
            tic = toc
        #print(np.std(sess.run(out, feed_dict={x:Xtest})))
    return acclist, stdlist#, trainlist, testlist


In [None]:
#fun, optimizer, baselr, Adj, SD, Scales Epochs
base_lr=0.01
scalelist = [0.1,0.3,1,3,10] #A
#scalelist = [0.01,0.1,1,10,100] #B
#scalelist = [1e-8,1e-6,1e-4,1e-2,1] #C
#scalelist = [1,100,1e4,1e6,1e8] #D
ep=10
adj=True
sd=0
ACC = []
STD = []
fname='C3AD1e-2TTA10'
for scale in scalelist:
    acclist, stdlist = runnet(Xtrain, Xtest, ytrain, ytest, cnn3, base_lr=base_lr,epochs = ep,scale=scale,learning_rate_adjusted = adj,seed=sd)
    ACC.append(acclist)
    STD.append(stdlist)
plt.title('CNN1 - Accuracy\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))
for i in range(len(ACC)):
    plt.plot(ACC[i], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-ACC.jpg', dpi=100)
plt.show()
plt.title('CNN1 - Logits StDev\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))   
for i in range(len(STD)):
    plt.plot(STD[i][20:], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-STD.jpg', dpi=100)
plt.show()

In [None]:
base_lr=0.01
#scalelist = [0.1,0.3,1,3,10] #A
scalelist = [0.01,0.1,1,10,100] #B
#scalelist = [1e-8,1e-6,1e-4,1e-2,1] #C
#scalelist = [1,100,1e4,1e6,1e8] #D
ep=10
adj=True
sd=0
ACC = []
STD = []
fname='C3AD1e-2TTB10'
for scale in scalelist:
    acclist, stdlist = runnet(Xtrain, Xtest, ytrain, ytest, cnn3, base_lr=base_lr,epochs = ep,scale=scale,learning_rate_adjusted = adj,seed=sd)
    ACC.append(acclist)
    STD.append(stdlist)
plt.title('CNN1 - Accuracy\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))
for i in range(len(ACC)):
    plt.plot(ACC[i], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-ACC.jpg', dpi=100)
plt.show()
plt.title('CNN1 - Logits StDev\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))   
for i in range(len(STD)):
    plt.plot(STD[i][20:], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-STD.jpg', dpi=100)
plt.show()

In [None]:
base_lr=0.001
scalelist = [0.1,0.3,1,3,10] #A
#scalelist = [0.01,0.1,1,10,100] #B
#scalelist = [1e-8,1e-6,1e-4,1e-2,1] #C
#scalelist = [1,100,1e4,1e6,1e8] #D
ep=10
adj=True
sd=0
ACC = []
STD = []
fname='C3AD1e-3TTA10'
for scale in scalelist:
    acclist, stdlist = runnet(Xtrain, Xtest, ytrain, ytest, cnn3, base_lr=base_lr,epochs = ep,scale=scale,learning_rate_adjusted = adj,seed=sd)
    ACC.append(acclist)
    STD.append(stdlist)
plt.title('CNN1 - Accuracy\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))
for i in range(len(ACC)):
    plt.plot(ACC[i], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-ACC.jpg', dpi=100)
plt.show()
plt.title('CNN1 - Logits StDev\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))   
for i in range(len(STD)):
    plt.plot(STD[i][20:], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-STD.jpg', dpi=100)
plt.show()

In [None]:
base_lr=0.01
scalelist = [0.1,0.3,1,3,10] #A
#scalelist = [0.01,0.1,1,10,100] #B
#scalelist = [1e-8,1e-6,1e-4,1e-2,1] #C
#scalelist = [1,100,1e4,1e6,1e8] #D
ep=10
adj=False
sd=0
ACC = []
STD = []
fname='C3AD1e-2FTA10'
for scale in scalelist:
    acclist, stdlist = runnet(Xtrain, Xtest, ytrain, ytest, cnn3, base_lr=base_lr,epochs = ep,scale=scale,learning_rate_adjusted = adj,seed=sd)
    ACC.append(acclist)
    STD.append(stdlist)
plt.title('CNN1 - Accuracy\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))
for i in range(len(ACC)):
    plt.plot(ACC[i], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-ACC.jpg', dpi=100)
plt.show()
plt.title('CNN1 - Logits StDev\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))   
for i in range(len(STD)):
    plt.plot(STD[i][20:], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-STD.jpg', dpi=100)
plt.show()

In [None]:
base_lr=0.01
scalelist = [0.1,0.3,1,3,10] #A
#scalelist = [0.01,0.1,1,10,100] #B
#scalelist = [1e-8,1e-6,1e-4,1e-2,1] #C
#scalelist = [1,100,1e4,1e6,1e8] #D
ep=10
adj=True
sd=None
ACC = []
STD = []
fname='C3AD1e-2TFA10'
for scale in scalelist:
    acclist, stdlist = runnet(Xtrain, Xtest, ytrain, ytest, cnn3, base_lr=base_lr,epochs = ep,scale=scale,learning_rate_adjusted = adj,seed=sd)
    ACC.append(acclist)
    STD.append(stdlist)
plt.title('CNN1 - Accuracy\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))
for i in range(len(ACC)):
    plt.plot(ACC[i], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-ACC.jpg', dpi=100)
plt.show()
plt.title('CNN1 - Logits StDev\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(base_lr,adj,ep,sd))   
for i in range(len(STD)):
    plt.plot(STD[i][20:], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.savefig(fname+'-STD.jpg', dpi=100)
plt.show()

In [None]:
def generatemarkdown():
    import os
    lis=pd.Series(os.listdir('../working'))
    a=lis[lis.apply(lambda x: True if len(x)>4 and x[-4:]=='.jpg' and x[:4]=='C1AD' else False)]
    a=a.sort_values()
    for name in a:
        print('<img src=\''+name+'\'>')
    
generatemarkdown()

In [None]:
from subprocess import check_output
print(check_output(["ls", "../working"]).decode("utf8"))

<img src='FCAD1e-3TFA50-ACC.jpg'>
<img src='FCAD1e-3TFA50-STD.jpg'>
<img src='FCAD1e-3FTA50-ACC.jpg'>
<img src='FCAD1e-3FTA50-STD.jpg'>

<img src='C1AD1e-1TTA10-ACC.jpg'>
<img src='C1AD1e-1TTA10-STD.jpg'>
<img src='C1AD1e-2TFA10-ACC.jpg'>
<img src='C1AD1e-2TFA10-STD.jpg'>
<img src='C1AD1e-2TTA10-ACC.jpg'>
<img src='C1AD1e-2TTA10-STD.jpg'>
<img src='C1AD1e-3FTA10-ACC.jpg'>
<img src='C1AD1e-3FTA10-STD.jpg'>
<img src='C1AD1e-3TTA10-ACC.jpg'>
<img src='C1AD1e-3TTA10-STD.jpg'>
<img src='C1AD1e-3TTB10-ACC.jpg'>
<img src='C1AD1e-3TTB10-STD.jpg'>
<img src='C1AD1e-4FTD10-ACC.jpg'>
<img src='C1AD1e-4FTD10-STD.jpg'>
<img src='C1AD1e-4TTA10-ACC.jpg'>
<img src='C1AD1e-4TTA10-STD.jpg'>