# CS189 : Introduction to Machine Learning
## Homework 6
### SID : 23274190  Name : Hye Soo Choi

#### Neural Networks for MNIST Digit Recognition
In this homework, we will implement neural networks to classify handwritten digits using raw pixels as features. We will be using the MNIST digits dataset that you used in previous homework assignments. The state-of-the-art error rate on this dataset using deep convolutional neural networks is around 0.5%. For this assignment, we should, with appropriate parameter settings, get approximately or better than 6% error using a neural network with one hidden layer.

In [1]:
import scipy.io as sio
import numpy as np
import numpy.random as nr
from sklearn import preprocessing

#### Preprocessing

We import the data and transform each label between 0 and 9 to a vector of length $10$ which has single $1$ in the position of true class and $0$ everywhere else. 

In [150]:
train_data = sio.loadmat('./dataset/train.mat')
test_data = sio.loadmat('./dataset/test.mat')

In [149]:
tr_img = train_data['train_images']
tr_lb_num = train_data['train_labels'][:,0]
tr_img = np.reshape(tr_img, (784, 60000), order = 'F')

TypeError: 'function' object is not subscriptable

In [151]:
tr_lb_num = train_data['train_labels'][:,0]

In [4]:
ts_img = np.reshape(np.transpose(test_data['test_images']), (784, 10000), order = 'F')

In [5]:
all_img = np.append(tr_img, ts_img, axis = 1)

In [6]:
temp_lb = np.zeros((len(tr_lb),10))

for i in np.arange(len(tr_lb)):
    j = tr_lb_num[i]
    temp_lb[i,j] = 1

tr_lb = temp_lb

Then, we normalize, or standardize, all feature vectors.

In [7]:
all_img = all_img.astype(float) # change data type from int to float, in order to 
                                # facilitate calculations for standardization
all_img = preprocessing.scale(all_img, axis = 1)

#### Seperating into training data and validation data

We use 50,000 training points and 10,000 validation for the reported train/validation accuracies.

In [8]:
ts_img = all_img[:, 60000:70000]
tr_img = all_img[:, 0:60000]
valid_ind = nr.choice(60000, 10000, replace = False)
vd_img = tr_img[: , valid_ind]
tr_img = tr_img[:, np.setdiff1d(np.arange(60000), valid_ind)]
vd_lb = tr_lb[valid_ind, :]
tr_lb = tr_lb[np.setdiff1d(np.arange(60000), valid_ind), :]

In [155]:
vd_lb_num = tr_lb_num[valid_ind]
tr_lb_num = tr_lb_num [np.setdiff1d(np.arange(60000), valid_ind)]

#### Neural Network
In this assignment, we are asked to implement a neural network with one hidden layer.
1.  We will be using a hidden layer of size 200. Let $n_{in} = 784$, the number of features for the digits class. Let $n_{hid} = 200$, the size of the hidden layer. Finally, let $n_{out}$ = 10, the number of classes. Then, we will have $n_{in} + 1$ units in the input layer, $n_{hid} + 1$ units in the hidden layer, and $n_{out}$ units in the output layer. The input and hidden layers have one additional unit which always takes a value of $1$ to represent bias. The output layer size is set to the number of classes. Each label will have to be transformed to a vector of length $10$ which has a single $1$ in the position of the true class and $0$ everywhere else.

2. The parameters of this model are the following:
-  $V$ , a $n_{hid}$-by-($n_{in} + 1$) matrix where the $(i; j)$-entry represents the weight connecting the $j$-th unit in the input layer to the $i$-th unit in the hidden layer. The $i$-th row of $V$ represents the ensemble of weights feeding into the $i$-th hidden unit. Note: there is an additional row for weights connecting the bias term to each unit in the hidden layer.
- $W$, a $n_{out}$-by-($n_{hid} + 1$) matrix where the (i; j)-entry represents the weight connecting the j-th unit in the hidden layer to the i-th unit in the output layer. The i-th row of W represents the ensemble of weights feeding into the $i$-th output unit. Note: again there is an additional row for weights connecting the bias term to each unit in the output layer.

#### Initialization of Weights

We initialize your weights with random values. This allows us to break
symmetry that occurs when all weights are initialized to 0. Some ways to do this are to initialize by drawing values from a uniform distribution from [􀀀$-\epsilon$, 􀀀$\epsilon$] or from a Gaussian distribution with mean 0 and variance 􀀀$\epsilon^2$ where 􀀀$\epsilon$ is some small fixed constant.


In [9]:
nr.seed(0)
n_in = 784
n_hid = 200
n_out = 10
epsilon = 0.01
V0 = nr.normal(scale = epsilon, size = (n_hid, n_in + 1))
W0 = nr.normal(scale = epsilon, size = (n_out, n_hid + 1))

#### Adding one additional unit which always takes a value of 1 to represent bias

In [10]:
def add_bias(mat):
    if len(mat.shape) > 1:
        ncol = mat.shape[1]
        temp = np.array([[1.0 for j in range(ncol)]])
        mat = np.append(mat, temp, axis = 0)
    else:
        mat = np.append(mat, [1.0], axis = 0)
    return mat

In [11]:
tr_img = add_bias(tr_img) # add one column of 1's to all data

In [12]:
tr_img.shape

(785, 50000)

#### Two loss functions : Mean-Squared Error and Cross-Entropy Error

In [77]:
def mean_squared(true, pred):
    temp = np.square(true - pred)
    err = np.sum(temp)
    return err/2

def cross_entropy(true, pred):
    n,k = true.shape
    ind = (true == 1)
    temp = np.sum(true[ind] * np.log(pred[ind])) + np.sum((1-true[~ind]) * np.log(1-pred[~ind]))
    err = - temp
    return err

In [81]:
cross_entropy(tr_lb[1:2,],predict(V0,W0, tr_img[:,1:2]))

6.9353285217470635

We use tanh activation function for the hidden layer units and the sigmoid function for the output layer units.

In [76]:
   
def sigmoid_stb(mat):
    # Numerically-stable sigmoid function
    
    ind = (mat >= 0)
    temp = np.zeros(mat.shape)
    temp[ind] = 1/(1+np.exp(-mat[ind]))
    z =np.exp(mat[~ind])
    temp[~ind] = z /(1+z)
    
    return temp


def sigmoid(mat):
    temp = 1/(1+ np.exp(- mat))
    return temp


#### Calculating the loss for coefficient matrices W and V 

Here we implement a fuction that takes in two coefficient matrices W and V, and returns the values of output units when we use two matrices W and V as coefficients of linear combination for the hidden layer units and for the output layer units, respectively, to train neural network.

In [168]:
def predict( V,W, img):
    temp = np.dot(V, img)
    hidden = np.tanh(temp)
    hidden = add_bias(hidden)
    temp = np.dot(W, hidden)
    return np.transpose(sigmoid_stb(temp))

def calculate_loss(V,W, img, true_label, loss_fun):
    pred_label = predict(V, W, img)
    loss = loss_fun(true_label, pred_label)
    return loss

def misclassification(V,W,img, true_num):
    pred = predict(V,W,img)
    pred_num = np.argmax(pred, axis = 1)
    return np.sum(pred_num != true_num)/len(true_num)
    

#### Back propagation in Stochastic Gradient Descent
The procedure of matrix $V$ influencing on the final output and thus the total mean squared error can be divided into several steps as follows:
$$
V \mapsto A = Vx \mapsto B = \tanh(A) \mapsto C = WB^* \mapsto D = sigmoid(C) \mapsto E =\frac{1}{2}||Y-D||_2^2.
$$
Therefore, stepwise,

\begin{align*}
\frac{\partial E}{\partial D} &= (D - Y),\\
\frac{\partial D}{\partial C} &= diag(D)(I-diag(D)),\\
\frac{\partial C}{\partial B} &= W^\top,\\
\frac{\partial B}{\partial A} &= I - diag(B^2),\\
\frac{\partial A}{\partial V_j} &= diag(x_j),\\
\frac{\partial C}{\partial W_j} &= diag(B^*_j)\\
\end{align*}
where $x$ is a $(n_{in} + 1)$-dim column vector, $V_j,W_j$ denotes the $j$th column of the matrix $V, W$, respectively, $B^*$ is the matrix that results from adding a row of $1$ to $B$. In case we use cross-entropy instead, 
$$
\frac{\partial E}{\partial D} = Y/D - (1-Y)/(1-D)
$$

In [133]:
def find_gradient(V,W, i, loss_fun_name):
    
    x = tr_img[:,i:i+1]
    y = np.transpose(tr_lb[i:i+1,:])
    A = np.dot(V, x)
    B = np.tanh(A)
    B_bias = add_bias(B)
    C = np.dot(W, B_bias)
    D = sigmoid_stb(C)

    I = np.identity(n_out)
    if loss_fun_name == 'mean_squared':
        dEdD = - y + D
    elif loss_fun_name == 'cross_entropy':
        ind = (y == 1)
        temp = np.zeros(y.shape)
        temp[ind] = - y[ind]/D[ind]
        temp[~ind] = (1-y[~ind])/(1-D[~ind])
        dEdD = temp
        

    dDdC = np.multiply(D, 1-D)
    dCdB = np.transpose(W[:, :n_hid])
    dBdA = 1- np.square(B)
    
    dEdC = np.multiply(dDdC, dEdD)
    dEdB = np.dot(dCdB, dEdC)
    dEdA = np.multiply(dBdA, dEdB)
    dEdV = np.array([])
    dEdW = np.array([])
 
    
    dAdV = x[:,0]
    dEdV = np.outer(dEdA[:,0], dAdV)

    dCdW = B_bias[:,0]
    dEdW = np.outer(dEdC[:,0], dCdW)

        
    return np.concatenate((dEdV.ravel('F'), dEdW.ravel('F')))


In [134]:
a = find_gradient(V0, W0, 2, 'mean_squared')

In [136]:
b = find_gradient(V0, W0, 2, 'cross_entropy')

In [86]:
def column_to_matrix(vec, r, c):
    return vec.reshape((r,c), order = 'F')
    
def matrix_to_column(mat):
    return mat.ravel(order = 'F')
    

#### Numerical Gradient Checking

In [98]:
def numerical_gradient(V, W, i, eps = 1e-8):
    grad = np.concatenate((matrix_to_column(V), matrix_to_column(W)))

    num_grad = np.zeros(grad.shape)
    temp = grad
    for j in range(len(grad)):
        
        temp[j] = grad[j] + eps
        V_temp = column_to_matrix(temp[0:n_hid*(n_in + 1)], n_hid, n_in + 1)
        W_temp = column_to_matrix(temp[n_hid*(n_in + 1):], n_out, n_hid + 1)
        loss1 = calculate_loss(V_temp,W_temp, tr_img[:, i:i+1], tr_lb[i:i+1,:], mean_squared)
        

        temp[j] = grad[j] - 2 * eps
        V_temp = column_to_matrix(temp[0:n_hid*(n_in + 1)], n_hid, n_in + 1)
        W_temp = column_to_matrix(temp[n_hid*(n_in + 1):], n_out, n_hid + 1)
        loss2 = calculate_loss(V_temp,W_temp, tr_img[:, i:i+1],tr_lb[i:i+1,:], mean_squared)
        
        num_grad[j] = (loss1-loss2)/(2*eps)
        temp = grad
        
    return num_grad
        
    

In [None]:
num_grad = numerical_gradient(V0, W0, 2)

In [28]:
def norm(l):
    return np.sqrt(np.sum(np.square(l)))

In [135]:
norm(a-num_grad)/norm(a + num_grad)

1.7624782661673705e-06

In [99]:
def numerical_gradient_cross_entropy(V, W, i, eps = 1e-8):
    grad = np.concatenate((matrix_to_column(V), matrix_to_column(W)))

    num_grad = np.zeros(grad.shape)
    temp = grad
    for j in range(len(grad)):
        
        temp[j] = grad[j] + eps
        V_temp = column_to_matrix(temp[0:n_hid*(n_in + 1)], n_hid, n_in + 1)
        W_temp = column_to_matrix(temp[n_hid*(n_in + 1):], n_out, n_hid + 1)
        loss1 = calculate_loss(V_temp,W_temp, tr_img[:, i:i+1], tr_lb[i:i+1,:], cross_entropy)
        

        temp[j] = grad[j] - 2 * eps
        V_temp = column_to_matrix(temp[0:n_hid*(n_in + 1)], n_hid, n_in + 1)
        W_temp = column_to_matrix(temp[n_hid*(n_in + 1):], n_out, n_hid + 1)
        loss2 = calculate_loss(V_temp,W_temp, tr_img[:, i:i+1],tr_lb[i:i+1,:], cross_entropy)
        
        num_grad[j] = (loss1-loss2)/(2*eps)
        temp = grad
        
    return num_grad
        

In [100]:
num_grad_cross = numerical_gradient_cross_entropy(V0, W0, 2)

In [137]:
norm(b-num_grad_cross)/norm(b+num_grad_cross)

1.798625244755789e-06

This proves that our way to find a gradient actually did its job pretty well.

#### Train Neural Network by Stochastic gradient descent using Mean Squared loss

In [130]:
def train(V, W, img, true, step):
    n = len(true)
    ind = nr.choice(n, n, replace = False)
    V_temp = V
    W_temp = W

    VW_temp = np.concatenate((matrix_to_column(V), matrix_to_column(W)))

    
    for i in ind:
        grad = find_gradient(V_temp,W_temp, i, 'mean_squared')
        VW_temp = VW_temp - step * grad
        

    return (V_temp, W_temp)
            

#### Train Neural Network by Stochastic gradient descent using Cross-entropy loss


In [147]:
def train_cross_entropy(V, W, img, true, index, start, end, step):
    n = len(true)

    V_temp = V
    W_temp = W

    
    VW_temp = np.concatenate((matrix_to_column(V), matrix_to_column(W)))
    
    for i in index[start * 1000 : end * 1000]:
        grad = find_gradient(V_temp,W_temp, i, 'cross_entropy')
        VW_temp = VW_temp - step * grad
        V_temp = column_to_matrix(VW_temp[0:n_hid*(n_in + 1)], n_hid, n_in + 1)
        W_temp = column_to_matrix(VW_temp[n_hid*(n_in + 1):], n_out, n_hid + 1)

    return (V_temp, W_temp)        

In [109]:
V_temp = V0
W_temp = W0
ind = nr.choice(50000,50000, replace = False)
loss = []
for j in range(500):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, mean_squared)
    print(j, ':', loss_temp)
    loss = np.append(loss, [loss_temp])

0 : 40929.7207341
1 : 30919.863594
2 : 25210.3998269
3 : 19617.1455284
4 : 19978.4905853
5 : 15535.5047449
6 : 14941.930507
7 : 14549.8250461
8 : 13352.3201274
9 : 13019.6248243
10 : 11819.2054932
11 : 10737.4617504
12 : 10472.0429109
13 : 10584.1127487
14 : 9445.36606802
15 : 9185.59786763
16 : 9587.57847501
17 : 9033.25093584
18 : 8746.21794187
19 : 8430.37232232
20 : 8319.67534463
21 : 7985.80457463
22 : 7879.50913551
23 : 7945.11128709
24 : 7909.94333898
25 : 7613.05979041
26 : 7543.93025939
27 : 7225.68056364
28 : 7039.57881143
29 : 7312.09177147
30 : 7286.55141191
31 : 7223.99054767
32 : 6907.16692561
33 : 6770.77727944
34 : 6967.78088041
35 : 6994.37608829
36 : 6753.22096898
37 : 6309.90442219
38 : 6517.28229107
39 : 6405.37610717
40 : 6395.6374105
41 : 6221.81380074
42 : 6305.94040168
43 : 6156.15985791
44 : 5966.26074138
45 : 5833.42905317
46 : 5864.27592939
47 : 6108.95630715
48 : 6268.94687916
49 : 6225.00257663
50 : 6022.1553757
51 : 5971.17301981
52 : 5820.6534213
53 : 603

In [110]:
V1, W1 = (V_temp, W_temp)

In [138]:
V_temp = V1
W_temp = W1
ind = nr.choice(50000,50000, replace = False)
loss = []
for j in range(500):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, mean_squared)
    print(j, ':', loss_temp)
    loss = np.append(loss, [loss_temp])

0 : 2131.68361506
1 : 2239.21629527
2 : 2141.299011
3 : 2177.4032656
4 : 2152.71774686
5 : 2100.13056585
6 : 2063.03047832
7 : 2253.72983597
8 : 2188.0815015
9 : 2048.68205625
10 : 2059.82275364
11 : 2103.55324848
12 : 2012.03336882
13 : 2031.89138515
14 : 1992.99635758
15 : 2003.66943677
16 : 2005.01650948
17 : 2009.89931144
18 : 1964.8693643
19 : 1976.66856727
20 : 2074.75036626
21 : 1915.95177943
22 : 1971.68624854
23 : 1970.4870967
24 : 2038.05327975
25 : 2007.79276962
26 : 2199.57572599
27 : 2074.23039717
28 : 2049.20359072
29 : 2016.2180123
30 : 2072.76581039
31 : 1932.08817488
32 : 1979.11928953
33 : 1917.05262618
34 : 1908.67575909
35 : 1950.33675721
36 : 1957.7083667
37 : 2006.44775612
38 : 1946.06352159
39 : 1941.3885433
40 : 1923.21175838
41 : 1920.18745028
42 : 1940.25837992
43 : 1951.5320429
44 : 1913.00647776
45 : 1880.84979774
46 : 1876.53148798
47 : 1907.69355618
48 : 1900.46772235
49 : 1889.07203212
50 : 1852.29777431
51 : 1932.60097347
52 : 1890.64119051
53 : 1907.763

In [139]:
V2, W2, loss2 = (V_temp, W_temp, loss)

In [145]:
V_temp = V2
W_temp = W2
ind = nr.choice(50000,50000, replace = False)
loss = []
misclass = []
for j in range(50):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01 * 0.5)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, cross_entropy)
    mis_temp = misclassification(V_temp, W_temp, tr_img, tr_lb_num)
    loss = np.append(loss, [loss_temp])
    misclass= np.append(misclass, [mis_temp])

0 : 1518.88366495
1 : 1472.68700892
2 : 1437.62181461
3 : 1396.72880676
4 : 1418.97733638
5 : 1403.41458587
6 : 1405.3019995
7 : 1379.97832374
8 : 1369.87057909
9 : 1364.61401999
10 : 1352.1224759
11 : 1337.61787427
12 : 1323.62192572
13 : 1309.8646066
14 : 1298.29256177
15 : 1285.48817468
16 : 1267.91751903
17 : 1256.70808696
18 : 1260.08949812
19 : 1258.83277902
20 : 1268.96367254
21 : 1260.127289
22 : 1253.42083281
23 : 1228.40892467
24 : 1233.0390875
25 : 1207.16206421
26 : 1219.32495575
27 : 1219.69926901
28 : 1222.72249881
29 : 1198.74396724
30 : 1190.84161247
31 : 1182.38808732
32 : 1191.57893976
33 : 1171.96580938
34 : 1171.88984645
35 : 1164.22692815
36 : 1169.23692589
37 : 1164.98306185
38 : 1167.87644755
39 : 1160.28113862
40 : 1168.27694946
41 : 1171.62881617
42 : 1177.5952027
43 : 1152.30978045
44 : 1153.20228718
45 : 1150.16257075
46 : 1145.03274988
47 : 1146.14037235
48 : 1145.99205295
49 : 1157.94312826
50 : 1158.65137601
51 : 1163.33586918
52 : 1162.04148241
53 : 1143.

In [146]:
V3,W3,loss3 = (V_temp, W_temp, loss)

In [162]:
V_temp = V3
W_temp = W3
ind = nr.choice(50000,50000, replace = False)
loss = []
misclass = []
for j in range(50):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01 * 0.5)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, cross_entropy)
    mis_temp = misclassification(V_temp, W_temp, tr_img, tr_lb_num)
    print(j, ':', loss_temp)
    loss = np.append(loss, [loss_temp])
    misclass= np.append(misclass, [mis_temp])

0 : 5021.19535318
1 : 4980.72587399
2 : 4970.43386972
3 : 4857.45571119
4 : 4896.27282937
5 : 4758.05864799
6 : 4839.07473864
7 : 4770.54219401
8 : 4720.96510383
9 : 4646.98320973
10 : 4619.18157203
11 : 4511.70544555
12 : 4494.51193733
13 : 4523.11011052
14 : 4476.570145
15 : 4487.95756542
16 : 4380.70085515
17 : 4318.76090504
18 : 4307.79725722
19 : 4366.01351285
20 : 4407.52082625
21 : 4272.09620929
22 : 4302.37946203
23 : 4228.52307568
24 : 4164.58804082
25 : 4196.0933082
26 : 4108.82706138
27 : 4181.22344488
28 : 4182.60917499
29 : 4082.41160142
30 : 4046.94064275
31 : 3909.12759841
32 : 3959.03055928
33 : 3907.34792905
34 : 3878.75225026
35 : 3946.57584942
36 : 3864.55177489
37 : 3883.09013801
38 : 3701.39023797
39 : 3758.25986081
40 : 3672.58077769
41 : 3754.00353455
42 : 3650.54539314
43 : 3555.73643215
44 : 3783.81261304
45 : 3682.44356532
46 : 3632.00961739
47 : 3588.47317888
48 : 3444.64211925
49 : 3391.59361413




In [170]:
V4,W4,loss4 = (V_temp, W_temp, loss)

In [169]:
np.argmax(predict(V_temp, W_temp, tr_img), axis=1).shape

(50000,)

In [171]:
V_temp = V4
W_temp = W4
ind = nr.choice(50000,50000, replace = False)
loss = loss4
misclass = []
for j in range(50):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01 * 0.5*0.5)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, cross_entropy)
    mis_temp = misclassification(V_temp, W_temp, tr_img, tr_lb_num)
    print(j, ':', loss_temp, mis_temp)
    loss = np.append(loss, [loss_temp])
    misclass= np.append(misclass, [mis_temp])

0 : 3276.54200273 0.00416
1 : 3227.76496375 0.0041
2 : 3178.67085123 0.00404
3 : 3170.57205328 0.00398
4 : 3118.23086945 0.0037
5 : 3077.22647014 0.00378
6 : 3036.80179254 0.00366
7 : 3027.12613261 0.00356
8 : 2989.76033637 0.00338
9 : 2977.84970447 0.00352
10 : 2940.90096923 0.00346
11 : 2945.26391913 0.00352
12 : 2909.25873887 0.00336
13 : 2880.5142112 0.00338
14 : 2842.7032195 0.00318
15 : 2831.89945389 0.00322
16 : 2781.31628191 0.00316
17 : 2766.95286153 0.00316
18 : 2750.95468067 0.00304
19 : 2744.42380028 0.00298
20 : 2743.00301528 0.0031
21 : 2726.80107928 0.00302
22 : 2685.992867 0.00298
23 : 2688.20781961 0.00302
24 : 2660.11207628 0.00302
25 : 2662.87670895 0.00306
26 : 2632.49497007 0.003
27 : 2661.41898186 0.00298
28 : 2624.70503203 0.00288
29 : 2645.93257955 0.00284
30 : 2606.00665401 0.00278
31 : 2589.87675422 0.00288
32 : 2540.8363283 0.00278
33 : 2538.50845851 0.00274
34 : 2543.13273226 0.00274
35 : 2522.51816174 0.00256
36 : 2478.04218555 0.00246
37 : 2455.75245958 0.

In [172]:
V5,W5,loss5 = (V_temp, W_temp, loss)

In [173]:
V_temp = V5
W_temp = W5
ind = nr.choice(50000,50000, replace = False)
loss = loss5
misclass = []
for j in range(50):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01 * 0.5*0.5)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, cross_entropy)
    mis_temp = misclassification(V_temp, W_temp, tr_img, tr_lb_num)
    print(j, ':', loss_temp, mis_temp)
    loss = np.append(loss, [loss_temp])
    misclass= np.append(misclass, [mis_temp])

0 : 2275.27653746 0.00206
1 : 2269.01665465 0.00202
2 : 2252.78138574 0.0021
3 : 2256.78304141 0.00204
4 : 2252.4089881 0.00202
5 : 2236.27507948 0.00198
6 : 2234.35864108 0.00204
7 : 2221.62930482 0.0021
8 : 2214.89033171 0.00208
9 : 2203.72950976 0.00204
10 : 2187.5632859 0.002
11 : 2199.47764528 0.00204
12 : 2186.23753563 0.00202
13 : 2165.30236341 0.00202
14 : 2159.28772825 0.002
15 : 2160.60911268 0.00206
16 : 2150.73977013 0.002
17 : 2145.58062905 0.00202
18 : 2137.59131412 0.00198
19 : 2133.06834618 0.00204
20 : 2130.28985702 0.0021
21 : 2126.15197018 0.00198
22 : 2100.87093677 0.00196
23 : 2100.6274171 0.002
24 : 2088.12451461 0.00198
25 : 2080.09991055 0.00206
26 : 2062.15391893 0.00196
27 : 2068.01374169 0.00198
28 : 2056.49964896 0.00186
29 : 2043.07786474 0.00168
30 : 2055.67890459 0.00184
31 : 2031.70050911 0.00176
32 : 2011.6799115 0.00168
33 : 2031.81337441 0.0017
34 : 2019.94121244 0.00166
35 : 2001.13926872 0.00156
36 : 2001.35871355 0.00144
37 : 1979.18396013 0.00152


In [175]:
V6,W6,loss6, misclass6 = (V_temp, W_temp, loss, misclass)

In [177]:
V_temp = V6
W_temp = W6
ind = nr.choice(50000,50000, replace = False)
loss = loss6
misclass = misclass6
for j in range(50):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01 * 0.5*0.5*0.5)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, cross_entropy)
    mis_temp = misclassification(V_temp, W_temp, tr_img, tr_lb_num)
    print(j, ':', loss_temp, mis_temp)
    loss = np.append(loss, [loss_temp])
    misclass= np.append(misclass, [mis_temp])

0 : 1863.18855106 0.00142
1 : 1843.609186 0.00142
2 : 1841.00497505 0.00136
3 : 1839.06191727 0.00138
4 : 1839.90264808 0.00138
5 : 1837.09315191 0.00136
6 : 1831.52047536 0.00136
7 : 1824.48449766 0.00134
8 : 1818.8428198 0.0014
9 : 1813.37639853 0.00138
10 : 1811.2734474 0.0014
11 : 1808.59884281 0.0014
12 : 1809.87539449 0.0014
13 : 1805.65244716 0.0014
14 : 1798.58158101 0.00138
15 : 1794.25511009 0.00136
16 : 1788.72088345 0.00138
17 : 1786.21787384 0.00132
18 : 1782.86028796 0.00128
19 : 1779.26591601 0.00132
20 : 1779.15166205 0.00136
21 : 1778.38923872 0.00126
22 : 1773.21749301 0.00128
23 : 1770.26394077 0.00124
24 : 1768.51310962 0.0012
25 : 1767.07436332 0.0012
26 : 1761.91666461 0.00122
27 : 1757.76772587 0.00122
28 : 1755.04459447 0.00124
29 : 1755.38960523 0.00124
30 : 1751.24418704 0.00124
31 : 1744.63917305 0.0012
32 : 1740.77051948 0.00122
33 : 1734.822755 0.00124
34 : 1732.11887779 0.0012
35 : 1729.85687408 0.00126
36 : 1726.52989527 0.00124
37 : 1720.74226678 0.00128

In [203]:
misclassification(V7, W7, add_bias(vd_img), vd_lb_num)

0.036799999999999999

In [183]:
V7, W7, loss7, misclass7 = V_temp, W_temp, loss, misclass

In [204]:
V_temp = V7
W_temp = W7
ind = nr.choice(50000,50000, replace = False)
loss = loss7
misclass = misclass7
for j in range(50):
    V_temp, W_temp = train_cross_entropy(V_temp, W_temp, tr_img, tr_lb, ind, j, j+1, 0.01 * 0.5*0.5*0.5)
    loss_temp = calculate_loss(V_temp, W_temp, tr_img, tr_lb, cross_entropy)
    mis_temp = misclassification(V_temp, W_temp, tr_img, tr_lb_num)
    print(j, ':', loss_temp, mis_temp)
    loss = np.append(loss, [loss_temp])
    misclass= np.append(misclass, [mis_temp])

0 : 1673.39891164 0.00114
1 : 1673.73132343 0.00116
2 : 1668.38799214 0.00114
3 : 1666.20023313 0.00116
4 : 1666.98263246 0.00112
5 : 1672.47259482 0.00108
6 : 1668.82447881 0.0011
7 : 1670.05472661 0.00114
8 : 1664.67512736 0.00116
9 : 1659.12542647 0.00108
10 : 1656.01361218 0.00112
11 : 1654.16066225 0.00112
12 : 1648.90062829 0.00112
13 : 1647.0798692 0.00112
14 : 1644.17912072 0.00114
15 : 1641.87799619 0.00114
16 : 1639.5844865 0.00112
17 : 1640.82955209 0.00116
18 : 1638.42201023 0.00112
19 : 1638.56933917 0.00112
20 : 1636.88167151 0.00112
21 : 1634.62546983 0.00116
22 : 1631.8114599 0.00108
23 : 1628.68783636 0.00108
24 : 1622.27980808 0.00104
25 : 1619.75480139 0.00104
26 : 1618.64624012 0.0011
27 : 1616.2854415 0.00112
28 : 1619.19380151 0.0011
29 : 1614.36305919 0.00114
30 : 1609.25189574 0.00112
31 : 1606.04780794 0.00118
32 : 1604.62124476 0.00118
33 : 1602.70104739 0.0011
34 : 1597.00628172 0.0011
35 : 1593.12020112 0.00106
36 : 1591.28607763 0.00106
37 : 1586.17510919 0

In [206]:
V8, W8, loss8, misclass8 = V_temp, W_temp, loss, misclass
misclassification(V8, W8, add_bias(vd_img), vd_lb_num)

0.036799999999999999

In [209]:
misclassification(V5, W5, add_bias(vd_img), vd_lb_num)

0.036400000000000002

In [212]:
ts_img

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

#### Kaggle

In [215]:
pred_kag_cross = predict(V8, W8, add_bias(ts_img))
pred_num_cross = np.argmax(pred_kag_cross,axis=1)

In [221]:
pred_num_cross

array([4, 4, 2, ..., 4, 7, 2])

In [218]:
pred_cross = np.asarray([[i+1, pred_num_cross[i]] for i in np.arange(10000)])
np.savetxt('pred_cross.csv',pred_cross,fmt = '%1.u' , delimiter = ',', header = 'Id,Category',comments='')

In [219]:
ts_img.shape

(784, 10000)

In [220]:
pred_kag_cross

array([[  1.44160482e-03,   5.22016001e-06,   1.04723061e-03, ...,
          4.94551843e-05,   4.58721870e-02,   1.61482523e-03],
       [  6.82668517e-04,   2.43801874e-07,   6.31688775e-02, ...,
          1.16601674e-05,   9.39459726e-08,   4.43591382e-03],
       [  6.57341044e-04,   3.75681877e-06,   1.96056978e-02, ...,
          1.88533313e-06,   8.64890747e-06,   5.63097725e-07],
       ..., 
       [  3.35523814e-06,   3.97528479e-05,   8.69212672e-05, ...,
          5.34392809e-01,   8.24591765e-07,   1.32949370e-06],
       [  7.41625673e-03,   1.56300075e-05,   5.49985906e-05, ...,
          9.96520548e-01,   4.97298308e-04,   4.59268457e-02],
       [  9.72882343e-07,   8.73044788e-03,   7.67536801e-02, ...,
          5.25550347e-03,   3.83360121e-07,   1.04308173e-06]])