## Convolutional Neural Network

Every image is a matrix of pixel values. The black and white minages are represented as 255x255. However, with coloured images, particularly RGB (Red, Green, Blue)-based images, the presence of separate colour channels (3 in the case of RGB images) introduces an additional ‘depth’ field to the data, making the input 3-dimensional. Hence, for a given RGB image of size, say 255×255 (Width x Height) pixels, we’ll have 3 matrices associated with each image, one for each of the colour channels.
Thus the image in it’s entirety, constitutes a 3-dimensional structure called the Input Volume (255x255x3).

Computers learn about the image from the pixel values associated with the images. CNNs are biologically-inspired models, the way in which mammals visually perceive the world around them using a layered architecture of neurons in the brain. CNN allow computers to see, in other words, CCN are used to recognize images by feeding the images to the system and making it learn the image. 

CNN has two parts: feature learning (Conv, Relu,and Pool) and classification(FC and softmax).

### Some important definations: 

#### Filter:
Filter, Kernel, or Feature Detector is a small matrix used for features detection. A typical filter on the first layer of a ConvNet might have a size [5x5x3].


#### Acvtivation Map:
Convolved Feature, Activation Map or Feature Map is the output volume formed by sliding the filter over the image and computing the dot product.

#### Receptive field:
Receptive field is a local region of the input volume that has the same size as the filter.

#### Stride:
Stride has the objective of producing smaller output volumes spatially. For example, if a stride=2, the filter will shift by the amount of 2 pixels as it convolves around the input volume. Normally, we set the stride in a way that the output volume is an integer and not a fraction. Common stride: 1 or 2 (Smaller strides work better in practice), uncommon stride: 3 or more.

#### Padding:
Zero-padding adds zeros around the outside of the input volume so that the convolutions end up with the same number of outputs as inputs. If we don’t use padding the information at the borders will be lost after each Conv layer, which will reduce the size of the volumes as well as the performance.

### Convolution:

1. A convolution is an orderly procedure where two sources of information are intertwined.

2. Kernels are then convolved with the input volume to obtain so-called ‘activation maps’ (also called feature maps).

3. The real values of the kernel matrix change with each learning iteration over the training set, indicating that the network is learning to identify which regions are of significance for extracting features from the data.

4. We compute the dot product between the kernel and the input matrix. -The convolved value obtained by summing the resultant terms from the dot product forms a single entry in the activation matrix.

5. The patch selection is then slided (towards the right, or downwards when the boundary of the matrix is reached) by a certain amount called the ‘stride’ value, and the process is repeated till the entire input image has been processed. - The process is carried out for all colour channels.

6. instead of connecting each neuron to all possible pixels, we specify a 2 dimensional region called the ‘receptive field’ (say of size 5×5 units) extending to the entire depth of the input (5x5x3 for a 3 colour channel input), within which the encompassed pixels are fully connected to the neural network’s input layer. It’s over these small regions that the network layer cross-sections (each consisting of several neurons (called ‘depth columns’)) operate and produce the activation map. (reduces computational complexity)

### Pooling:

Pool Layer performs a function to reduce the spatial dimensions of the input, and the computational complexity of our model. And it also controls overfitting. It operates independently on every depth slice of the input. There are different functions such as Max pooling, average pooling, or L2-norm pooling. However, Max pooling is the most used type of pooling which only takes the most important part (the value of the brightest pixel) of the input volume.

### Activation layer: 

Major types of normalization: 

1. ReLU: ReLU Layer applies an elementwise activation function max(0,x), which turns negative values to zeros (thresholding at zero). This layer does not change the size of the volume and there are no hyperparameters.

2. Tanh: The range of the tanh function is [-1,1]

3. Sigmoid: The range of sigmoid function is [0,1]

### Regularization:

Dropout forces an artificial neural network to learn multiple independent representations of the same data by alternately randomly disabling neurons in the learning phase. Dropout is a vital feature in almost every state-of-the-art neural network implementation. To perform dropout on a layer, you randomly set some of the layer's values to 0 during forward propagation.

### Dataset used: 

I have used a simple csv file with 10 labels of various accessories, clothing, footwear as input file. Here, I will tweak various factors to understand the various factors and at the same time attain the accuracy of the model.

The data used here can be found on kaggle:
https://www.kaggle.com/zalando-research/fashionmnist/data

In [1]:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelBinarizer
import tensorflow as tf
import numpy as np
import warnings 

  from ._conv import register_converters as _register_converters


<br><br>We first divide the dataset into data points and labels. Labels are the recognition value to the data points, it recognises whether a image is of a shoe, a t-shirt, pants etc. The data points represent the pixel value for each image. Since the images are of 28X28, we have 784 data points for each image. The Label Binarizer converts the image labels to unique values and the convolutional network gives us a probability based on the label binarizer value. <br><br>

In [5]:
data = pd.read_csv('fashion_train.csv')
images_data = data.iloc[:,1:].values
train_labels = data.iloc[:,0].values
labels = LabelBinarizer().fit_transform(train_labels)

<br><br><b>Placeholder</b><br>
A placeholder is simply a variable that we will assign data to at a later date. It allows us to create our operations and build our computation graph, without needing the data. In TensorFlow terminology, we then feed data into the graph through these placeholders.
<br><br>

In [6]:
sess = tf.InteractiveSession()
x = tf.placeholder("float", shape=[None, 784])
y_ = tf.placeholder("float", shape=[None, 10])


<br><br>Defining the weight, bias, pool method for the network. Here after the neural network build is guided with the comments for each step we take<br><br>

In [51]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev = 0.007)
    return tf.Variable(initial)
    
def bias_variable(shape):
    initial = tf.constant(0.0008, shape = shape)
    return tf.Variable(initial)
    
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.avg_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

epochs_completed = 0
index_in_epoch = 0
num_examples = images_data.shape[0]

In [52]:
#Variable and bias 
W_conv1 = weight_variable([5, 5, 1, 32]) #5*5 matrix for 32 features and depth 1
b_conv1 = bias_variable([32])
x_image = tf.reshape(x, [-1,28,28,1])


#First convolution
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])#5*5 matrix for 64 features and depth 32
b_conv2 = bias_variable([64])


#second convolution
h_conv2 = tf.nn.tanh(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])


#densely connected layer, where we allow all neurons to merge and process entire image
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.tanh(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder("float")
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
sess.run(tf.global_variables_initializer())

In [53]:
BatchSize = 150
def generate_batch(images_data, labels, batch_size):
    batch_indexes = np.random.random_integers(0, len(images_data) - 1, batch_size)
    batch_features = images_data[batch_indexes]
    batch_labels = labels[batch_indexes]
    
    return (batch_features, batch_labels)


In [54]:
#split the data into training and validation
train_samples = int( len(images_data) / (1 / 0.88))

train_features = images_data[: train_samples]
train_labels   = labels[: train_samples]

validation_features = images_data[train_samples: ]
validation_labels = labels[train_samples: ]

In [55]:
TrainingStep = 500
accuracy_history = []
for i in range(TrainingStep):
    
    batch_features, batch_labels = generate_batch(train_features, train_labels, BatchSize)
    
    if i%200 == 0:
        accuracy_ = sess.run( accuracy, feed_dict = {x : validation_features, y_: validation_labels, keep_prob:1.0})
        accuracy_history.append(accuracy_)
        print("step  %i  and validation acc :%g "%(i, accuracy_))

    sess.run(train_step, feed_dict = { x: batch_features, y_: batch_labels, keep_prob:0.5})

  This is separate from the ipykernel package so we can avoid doing imports until


step  0  and validation acc :0.0437439 
step  200  and validation acc :0.81086 
step  400  and validation acc :0.84766 


In [56]:
data_test = pd.read_csv('fashion_test.csv')
images_data_test = data_test.iloc[:,1:].values
test_labels = data_test.iloc[:,0].values
t_labels = LabelBinarizer().fit_transform(test_labels)

In [57]:
print ("test accuracy %g"%accuracy.eval(feed_dict={
    x: images_data_test, y_: t_labels, keep_prob: 1.0}))


test accuracy 0.8429
