# Introduction

We will build from scratch a class of convolutional neural networks (CNNs) for 2D, implementing the algorithms using only minimal libraries such as NumPy.


We will also create a pooling layer and so on to complete the basic form of the CNN. The name of the class should be Scratch2dCNNClassifier.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def println(*str):
    for i in str:
        print(i)

# Test Forward Propagation (problem 2)

## NOTE
- [Reference](http://d2l.ai/chapter_convolutional-neural-networks/channels.html)
- Multiple output just means repeating different kernels many times, each gives one output.
- The kernel shape should be ($n_{out},n_{in},k_{height},k_{width}$)
- Summing all the input channels give one output result

In [3]:
# creating data, weight and bias
np.random.seed(0)
n_in = 1
n_out = 2
dim = (4,4)
kernel_size = (3,3)
print('n_in', n_in)
print('n_out', n_out)
print('dim',dim)
print('kernel_size', kernel_size)

# X = np.random.randint(0,10,(n_in, *dim))
X = np.array([[[ 1,  2,  3,  4],
                [ 5,  6,  7,  8],
                [ 9, 10, 11, 12],
                [13, 14, 15, 16]]])
println('X',X.shape,  X)

dup_needed = True
# W = np.random.randint(0,2,(n_out, n_in, *kernel_size)) #init kernel
W = np.array([[[[ 0.,  0.,  0.],
               [ 0.,  1.,  0.],
               [ 0., -1.,  0.]]],
              [[[ 0.,  0.,  0.],
               [ 0., -1.,  1.],
               [ 0.,  0.,  0.]]]])
B = np.random.randint(0,1, n_out)
println('W', W.shape, W)
println('B', B)

n_in 1
n_out 2
dim (4, 4)
kernel_size (3, 3)
X
(1, 4, 4)
[[[ 1  2  3  4]
  [ 5  6  7  8]
  [ 9 10 11 12]
  [13 14 15 16]]]
W
(2, 1, 3, 3)
[[[[ 0.  0.  0.]
   [ 0.  1.  0.]
   [ 0. -1.  0.]]]


 [[[ 0.  0.  0.]
   [ 0. -1.  1.]
   [ 0.  0.  0.]]]]
B
[0 0]


In [4]:
# duplicate X to match n_out
if dup_needed:
    X = np.vstack([[X]] * n_out)
    println('X', X.shape, X)

X
(2, 1, 4, 4)
[[[[ 1  2  3  4]
   [ 5  6  7  8]
   [ 9 10 11 12]
   [13 14 15 16]]]


 [[[ 1  2  3  4]
   [ 5  6  7  8]
   [ 9 10 11 12]
   [13 14 15 16]]]]


In [5]:
# forward

in_x, in_y = dim
ker_x, ker_y = kernel_size
conv_size_x = in_x - ker_x + 1
conv_size_y = in_y - ker_y + 1
output_shape = (conv_size_x, conv_size_y)
result = np.ones((n_out, *output_shape))
print('expected output shape: ', output_shape)
print('result shape: ', result.shape)
print('Xshape: ', X.shape)
for i in range(conv_size_x):
    for j in range(conv_size_y):
        print('convolving at: ', i,j)
        temp_x = X[:,:, i : i + ker_x, j : j + ker_y]
        # println('before: ', temp_x, temp_x.shape)
        temp = temp_x * W
        # print(1,temp)
        temp = np.sum(temp, axis = (2,3))
        # print(2, temp)
        temp = np.sum(temp, axis = 1) 
        # print(3,temp)
        result[:,i,j] = temp + B
        print(4,result[:,i,j] )

println('Forward', result.shape, result)

expected output shape:  (2, 2)
result shape:  (2, 2, 2)
Xshape:  (2, 1, 4, 4)
convolving at:  0 0
4 [-4.  1.]
convolving at:  0 1
4 [-4.  1.]
convolving at:  1 0
4 [-4.  1.]
convolving at:  1 1
4 [-4.  1.]
Forward
(2, 2, 2)
[[[-4. -4.]
  [-4. -4.]]

 [[ 1.  1.]
  [ 1.  1.]]]


# Test backward prop (problem 2)

## NOTEs
- Basic backward prop [Ref](https://www.youtube.com/watch?v=i94OvYb6noo)
- Using this property of chain rule, plus some clever tricks, we can find a great way to back propagate on conv neural network

### In ultra simplification form
The formula for forward propagation is basically (super simpilifield):

Given z is one cell of the output propagation, x are the selected cells from input X for one convolution, W is kernel and b is bias, the forward prop formula is as follows (for one output cell):
$$ z = y + b $$
$$ y = \sum_ix_i*w_i $$

So in backward propagation, given $\frac{dL}{dZ}$, with dz is one cell of dZ
$$\frac{dz}{dy} = \frac{dz}{db} = 1 \Rightarrow \frac{dL}{db} = \frac{dL}{dz}\frac{dz}{db} = \frac{dL}{dz}$$

Continuing, let's simply denote any sum $\sum$  as all the values that has the contribution of (or related to) a cell xi, So:
$$\frac{dL}{dx_i} = \sum \frac{dL}{dz}\frac{dz}{dy}\frac{dy}{dx_i}$$
$$ = \sum (\frac{dL}{dz} *1* w_i) $$ 
with wi being the weight that was use to multiply with xi to get corresponding y

To put this in words, the gradient of some $x_i$ is gradient of each of the output cell that $x_i$ contributed to multiply with the corresponding weight of $x_i$ used in that output cell.

So our formula is just basically some correct W multiply with some correct dZ. Just need to select the correct ones for each x. Here's an example of the gradient dX for (3,3) input with (2,2) kernel

- **X:**
$$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$$

- **dL/dx**, given the formula i described above, y here is actually dy or dz (i write y for simplicity)

$$\begin{bmatrix}
y1w1 & y1w2 + y2w3 & y3w3 \\ 
y1w3 + y3w1 & y1w4 + y2w3 + y3w2 + y4w1 & y2w4 + y4w2 \\ 
y3w3 & y3w4 + y4w3 & y4w4 
\end{bmatrix}$$

The way to get this output is to use a clever trick: **convolution + padding + transformed kernel**
- We rotate the kernel by 180 degrees
- And convove it through the padded gradient dZ (or dY)
- In the previoud example, padding is one, after convole gives expected result

The result blew my mind

### So how about gradient of weight?

Remember the function?

$$ y = \sum_ix_i*w_i $$

Given that, the formula is basically the same, just with different selection of x and z and different output shape

We, similar to the above tric, will
- Use the gradient of output Z as a kernel
- Convolve it through the input X
- Result is gradient of kernel W ( or K what ever you call it )

In [6]:
# sample dZ
dZ = np.array([[[ -4,  -4],
                   [ 10,  11]],
                  [[  1,  -7],
                   [  1, -11]]])
print('Grad dZ shape: ', dZ.shape)

Grad dZ shape:  (2, 2, 2)


In [7]:
# let's quickly define a convolve method, copying from the forward method
def convolve(X,K, B = None): #NOTE: INPUT FOR THIS METHOD MUST HAVE 4 dimensions (out, in, height, width)
    in_x, in_y = X.shape[-2], X.shape[-1]
    c_out, c_in, ker_x, ker_y = K.shape
    B = B if not B is None else np.zeros(c_out)
    x_out = in_x - ker_x + 1
    y_out = in_y - ker_y + 1
    output_shape = (x_out, y_out)
    result = np.ones((c_out, *output_shape))
    # print('result shape: ', result.shape)
    # print('in,out: ', c_in, c_out)
    # print('kshape: ', K.shape, 'Xshape: ', X.shape)
    for i in range(x_out):
        for j in range(y_out):
            temp_x = X[:,:, i : i + ker_x, j : j + ker_y]
            temp = temp_x * K
            temp = np.sum(temp, axis = (2,3))
            temp = np.sum(temp, axis = 1) 
            result[:,i,j] = temp + B

    # println('Final convolve result: ', result, result.shape)
    return result

def flip180(arr, axes = (-2,-1)):
    new_arr = np.rot90(arr,2, axes = axes)
    return new_arr
def padded(arr, pad_size = 1):
    return np.pad(arr, ((0,0),(0,0),(pad_size, pad_size),(pad_size, pad_size)), 'constant')
def swap_in_out_channels(arr, ax1 = 0, ax2 = 1):
    return np.swapaxes(arr, ax1, ax2)

In [8]:
# back prop

# remember to flip in/out channel of weights
W = swap_in_out_channels(W)

# bias gradient
dZ_dB = dZ
print('dZ_dB: ', dZ_dB, dZ_dB.shape)

#z with added dimension
dZ = np.stack([dZ]*n_in, axis = 0)
# println('Dz: ', dZ, dZ.shape)
# x gradient
padded_dZ = padded(dZ, pad_size = 2)
dZ_dX = convolve(padded_dZ, flip180(W))
println('Dz_dX: ', dZ_dX, dZ_dX.shape)
# w gradient
dZ_dW = convolve(swap_in_out_channels(X), dZ)
println('dZ_dW:', dZ_dW, dZ_dW.shape)

dZ_dB:  [[[ -4  -4]
  [ 10  11]]

 [[  1  -7]
  [  1 -11]]] (2, 2, 2)
Dz_dX: 
[[[  0.   0.   0.   0.]
  [  0.  -5.   4.  -7.]
  [  0.  13.  27. -11.]
  [  0. -10. -11.   0.]]]
(1, 4, 4)
dZ_dW:
[[[30. 27. 24.]
  [18. 15. 12.]
  [ 6.  3.  0.]]]
(1, 3, 3)


# Suppporting Classes

In [9]:
from numpy.random import default_rng
rng = default_rng()

class SimpleInitializer:
    def __init__(self, sigma = 0.1):
        self.sigma = sigma
    
    def W(self, dimension = ()):
        return rng.normal(0, self.sigma, dimension)
    
    def B(self,n=0):
        return rng.normal(0, self.sigma, (1,n))


# Test
# simp_init = SimpleInitializer()
# w = simp_init.W(dimension = (3,2))
# b = simp_init.B(2)

# println('W', w)
# println('B', b)

# x = np.random.randint(5,size = (2,3))
# println('X', x)

# println('x*w + b', x@w + b)

In [10]:
class SGDOptimizer():
    def __init__(self, learning_rate = 0.01):
        self.learning_rate = learning_rate
    def update(self, layer, dW, dB):
        layer.W = layer.W - self.learning_rate * dW
        layer.B = layer.B - self.learning_rate * dB


# Problem 1
2D Convolutional Layer

In [11]:
# Fake supporting classes
class SimpleInitializer:
  def W(*arg): pass
  def B(*arg): pass

class SGD:
  def update(*arg): pass

In [12]:
class Conv2d:
    def __init__(self, 
        initializer = SimpleInitializer(), 
        optimizer = SGD(), 
        kernel_size = (3,3), 
        n_out_channels = 3, 
        padding = 0, 
        stride = 1):
        #! let's not consider padding and stride at the moment
        
        self.initializer = initializer
        self.optimizer = optimizer
        self.kernel_size = kernel_size
        self.padding = padding
        self.stride = stride
        self.n_out = n_out_channels
        self.skip_init_weight = False


    def forward(self, X):
        if X.ndim != 3:
            raise 'Input must be 3-dimensional (input_channels, height, width)'
        
        # init size
        self.n_in, self.n_row, self.n_col = X.shape

        # init weight and biases
        if not self.skip_init_weight:
            self.W = self.initializer.W(dimension = (self.n_out, self.n_in, *self.kernel_size))
            self.B = self.initializer.B(self.n_out)

        # add one duplicated dimension to X
        self.X = np.stack([X]*self.n_out)

        return self.convolve(self.X, self.W, self.B)

    def backward(self, dZ):
        if dZ.ndim != 3:
            raise 'Input Gradient must be 3-dimensional (output_channels, height, width)'

        # remember to flip in/out channel of weights and X
        W = self.swap_in_out_channels(self.W)
        X = self.swap_in_out_channels(self.X)
        #z with added dimension
        dZ = np.stack([dZ]*n_in, axis = 0)
        # bias gradient
        dZ_dB = dZ
        # w gradient
        dZ_dW = self.convolve(X, dZ)

        self.optimizer.update(self, dZ_dW, dZ_dB)

        
        #z with pad
        pad_x, pad_y = self.kernel_size[0] - 1, self.kernel_size[1] - 1
        padded_dZ = self.padded(dZ, pad_x, pad_y)
        # x gradient
        dZ_dX = self.convolve(padded_dZ, self.flip180(W))
        return dZ_dX


    def convolve(self, X, W, B = None):
    #NOTE: INPUT FOR THIS METHOD MUST HAVE 4 dimensions (out, in, height, width)
        in_x, in_y = X.shape[-2], X.shape[-1]
        c_out, c_in, ker_x, ker_y = W.shape
        B = B if not B is None else np.zeros(c_out)
        x_out = in_x - ker_x + 1
        y_out = in_y - ker_y + 1
        output_shape = (x_out, y_out)
        result = np.ones((c_out, *output_shape))
        for i in range(x_out):
            for j in range(y_out):
                temp_x = X[:,:, i : i + ker_x, j : j + ker_y]
                temp = temp_x * W
                temp = np.sum(temp, axis = (2,3))
                temp = np.sum(temp, axis = 1) 
                result[:,i,j] = temp + B
        return result
    
    #! HELPERS

    def flip180(self, arr, axes = (-2,-1)):
        new_arr = np.rot90(arr,2, axes = axes)
        return new_arr
    def padded(self, arr, pad_x = 1, pad_y = 1):
        return np.pad(arr, ((0,0),(0,0),(pad_x, pad_x),(pad_y, pad_y)), 'constant')
    def swap_in_out_channels(self, arr, ax1 = 0, ax2 = 1):
        return np.swapaxes(arr, ax1, ax2)

In [13]:
# Test conv2D

cnn = Conv2d(kernel_size = (3,3), n_out_channels = 2)
cnn.skip_init_weight = True
cnn.W = W
cnn.B = B

X = np.array([[[ 1,  2,  3,  4],
                [ 5,  6,  7,  8],
                [ 9, 10, 11, 12],
                [13, 14, 15, 16]]])
cnn.W = np.array([[[[ 0.,  0.,  0.],
               [ 0.,  1.,  0.],
               [ 0., -1.,  0.]]],
              [[[ 0.,  0.,  0.],
               [ 0., -1.,  1.],
               [ 0.,  0.,  0.]]]])
cnn.B = np.array([0,0])
forward = cnn.forward(X)

# sample dZ
dZ = np.array([[[ -4,  -4],
                   [ 10,  11]],
                  [[  1,  -7],
                   [  1, -11]]])
backward = cnn.backward(dZ)

println('Forward (Z): ', forward.shape, forward)
println('Backward: (dZ/dX) ', backward.shape, backward)

Forward (Z): 
(2, 2, 2)
[[[-4. -4.]
  [-4. -4.]]

 [[ 1.  1.]
  [ 1.  1.]]]
Backward: (dZ/dX) 
(1, 4, 4)
[[[  0.   0.   0.   0.]
  [  0.  -5.   4.  -7.]
  [  0.  13.  27. -11.]
  [  0. -10. -11.   0.]]]


# Problem 3
Output size after 2-dimensional convolution

In [14]:
def conv_dim(dim, ker, pad, stride):
    dim, ker, pad, stride = np.array(dim), np.array(ker), np.array(pad), np.array(stride)
    result = (dim + 2 * pad - ker)/stride + 1
    return result.astype(np.int64)

dim = (4,4)
ker = (2,2)
pad = (2,2)
stride = (2,2)
print('Conv dim: ', conv_dim(dim, ker, pad, stride)) 


Conv dim:  [4 4]


# Problem 4
Creation of max pooling layer

In [15]:
from numpy import unravel_index
class MaxPool2D():
    def __init__(self, pool_size = (2,2), padding = (0,0), stride = (1,1)):
        self.pool_size = pool_size
        self.padding = padding
        self.stride = stride

    def forward(self, X):
        if X.ndim != 3: raise 'Invalid dimension, must be 3 (in, height, width)'

        self.dim = X.shape[1:]
        self.out_x, self.out_y = conv_dim(self.dim, self.pool_size, self.padding, self.stride)

        output = []
        self.max_indexes_array = []
        for X_channel in X:
            pool_result, max_indexes = self.pool_channel(X_channel)
            output.append(pool_result)
            self.max_indexes_array.append(max_indexes)
        return np.stack(output, 0)
    
    def pool_channel(self, X):
        output = np.zeros((self.out_x, self.out_y))
        x_pool, y_pool = self.pool_size
        max_indexes = []
        for i in range(self.out_x):
            for j in range(self.out_y):
                temp_x = X[i : i + x_pool,j : j + y_pool]

                kernel_max_index = unravel_index(temp_x.argmax(), temp_x.shape)
                max_index = (kernel_max_index[0] + i, kernel_max_index[1] + j)
                output[i,j] = X[max_index]
                max_indexes.append(max_index)
        return output, max_indexes

    def backward(self, dZ): 
        dX = []
        for channel, dz in enumerate(dZ):
            dx = np.zeros(self.dim)
            for i, gradient in enumerate(dz.flatten()):
                max_idx = self.max_indexes_array[channel][i]
                
                dx[max_idx] = gradient
            dX.append(dx)
        return np.stack(dX, 0) 




In [16]:
# test max pool
X = np.arange(27)
X = X.reshape((3,3,3))
println('X', X)
pool_layer = MaxPool2D()
forward = pool_layer.forward(X)
println('Pool forward: ', forward.shape, forward)

dZ = np.arange(12)
dZ = dZ.reshape((3,2,2))
println('dZ', dZ)
backward = pool_layer.backward(dZ)
println('Pool backward: ', backward.shape, backward)

X
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
Pool forward: 
(3, 2, 2)
[[[ 4.  5.]
  [ 7.  8.]]

 [[13. 14.]
  [16. 17.]]

 [[22. 23.]
  [25. 26.]]]
dZ
[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]
Pool backward: 
(3, 3, 3)
[[[ 0.  0.  0.]
  [ 0.  0.  1.]
  [ 0.  2.  3.]]

 [[ 0.  0.  0.]
  [ 0.  4.  5.]
  [ 0.  6.  7.]]

 [[ 0.  0.  0.]
  [ 0.  8.  9.]
  [ 0. 10. 11.]]]


# Problem 5 
Average Pooling layer

In [17]:
class AveragePool2D():
    def __init__(self, pool_size = (2,2), padding = (0,0), stride = (1,1)):
        self.pool_size = pool_size
        self.padding = padding
        self.stride = stride

    def forward(self, X):
        if X.ndim != 3: raise 'Invalid dimension, must be 3 (in, height, width)'

        self.dim = X.shape[1:]
        out_dim = conv_dim(self.dim, self.pool_size, self.padding, self.stride)
        
        return self.pool_full(X, out_dim)
        


    def pool_full(self, X, out_dim):
        output = []
        for X_channel in X:
            pool_result = self.pool_channel(X_channel, out_dim)
            output.append(pool_result)
        return np.stack(output, 0)
    
    def pool_channel(self, X, out_dim):
        out_x, out_y = out_dim
        output = np.zeros((out_x, out_y))
        x_pool, y_pool = self.pool_size
        for i in range(out_x):
            for j in range(out_y):
                temp_x = X[i : i + x_pool, j : j + y_pool]
                output[i,j] = temp_x.mean()
        return output

    def backward(self, dZ): 
        pad_x, pad_y = np.array(self.pool_size) - 1
        dZ = self.padded(dZ, pad_x, pad_y)

        return self.pool_full(dZ, self.dim)
    
    def padded(self, arr, pad_x = 1, pad_y = 1):
        return np.pad(arr, ((0,0),(pad_x, pad_x),(pad_y, pad_y)), 'constant')




In [18]:
# test mean pool
X = np.arange(27)
X = X.reshape((3,3,3))
println('X', X)
pool_layer = AveragePool2D()
forward = pool_layer.forward(X)
# println('Pool forward: ', forward.shape, forward)

dZ = np.arange(12)
dZ = dZ.reshape((3,2,2))
println('dZ', dZ)
backward = pool_layer.backward(dZ)
println('Pool backward: ', backward.shape, backward)

X
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
dZ
[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]
Pool backward: 
(3, 3, 3)
[[[0.   0.25 0.25]
  [0.5  1.5  1.  ]
  [0.5  1.25 0.75]]

 [[1.   2.25 1.25]
  [2.5  5.5  3.  ]
  [1.5  3.25 1.75]]

 [[2.   4.25 2.25]
  [4.5  9.5  5.  ]
  [2.5  5.25 2.75]]]


# Problem 6
Smoothing

In [19]:
class Flatten():
  def __init__(self): pass
  def forward(self, X):
    self.input_shape = X.shape
    return X.flatten()
  def backward(self,dZ):
    return dZ.reshape(self.input_shape)

# test mean pool
X = np.arange(27)
X = X.reshape((3,3,3))
println('X', X)
flatten_layer = Flatten()
forward = flatten_layer.forward(X)
# println('Pool forward: ', forward.shape, forward)

dZ = X
println('dZ', dZ)
backward = flatten_layer.backward(dZ)
println('Pool backward: ', backward.shape, backward)

X
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
dZ
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
Pool backward: 
(3, 3, 3)
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]


# Problem 7
Leaning and Estimation

## Supporting Classes

In [23]:
# Import layers from previous tasks

from deep_neural import SimpleInitializer
from deep_neural import SGD
from deep_neural import FC
from deep_neural import GetMiniBatch
from deep_neural import Sigmoid
from deep_neural import SoftMax
from deep_neural import Tanh
from deep_neural import DeepNeuralNetworkClassifier

# Data Set

In [24]:
#data set
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# #reshape
# X_train = X_train.reshape(-1, 784)
# X_test = X_test.reshape(-1, 784)
#scaling
X_train = X_train.astype(np.float)
X_test = X_test.astype(np.float)
X_train /= 255
X_test /= 255
#one hot encode for multiclass labels!
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
y_train_one_hot = enc.fit_transform(y_train[:, np.newaxis])
y_test_one_hot = enc.transform(y_test[:, np.newaxis])
#validation split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train_one_hot, train_size=0.5)

In [25]:
# test run deep neural
model = DeepNeuralNetworkClassifier(enc, debug = False, verbose = True, max_iter = 5)
n_in_features = X_train.shape[1]
batch_size  = 20
print('train shape: ', X_train.shape)
print('input features: ', n_in_features)
print('input channels: ', batch_size)


in_channel = X_train.shape[1]

l1 = Conv1DBatch(filter_size = 3, n_input = in_channel, n_output = 1, optimizer = AdaGrad(0.1))
model.add(l1,Tanh())
lshape = FlattenLayer()
model.add(lshape,TransparentFunction())
l2 = FC(26,100, SimpleInitializer(),SGD()) #cause output of conv is 394
model.add(l2,Tanh())
l3 = FC(100,10, SimpleInitializer(),SGD())
model.add(l3,SoftMax())

X shape:  (30000, 28, 28) type:  float64
Batch count:  1500
Layer 1:  28 400
Activ: 1: Tanh
Layer 2:  400 200
Activ: 2: Tanh
Layer 3:  200 10
Activ: 3: SoftMax
Epoch:  0


ValueError: operands could not be broadcast together with shapes (20,28,10) (200,1) 