# L3 – Convolutional neural network

### Materials
1. [ImageNet](http://www.image-net.org)
2. [Overview](https://en.wikipedia.org/wiki/Convolutional_neural_network#Pooling_layer) on wiki.
3. Stanford's [course](http://cs231n.stanford.edu) on convolutional networks + some [materials](http://cs231n.github.io/convolutional-networks/) on github.
4. [Pooling](https://arxiv.org/pdf/1412.6806.pdf)
5. [Dropout](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
6. [Batch normalization](https://arxiv.org/pdf/1502.03167.pdf)
7. [Data augmentation](http://cs231n.stanford.edu/reports/2017/pdfs/300.pdf)

### Models
1. [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)
2. [AlexNet](http://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012)
3. [VGGNet](https://arxiv.org/pdf/1409.1556.pdf)
4. [GoogLeNet](https://arxiv.org/pdf/1409.4842)
5. [ResNet](https://arxiv.org/pdf/1512.03385.pdf)

### Tutorials
1. [Guide](https://www.tensorflow.org/tutorials/layers) to conv nets training
2. More advanced [tutorial](https://www.tensorflow.org/tutorials/deep_cnn)
3. How to [using GPUs](https://www.tensorflow.org/tutorials/using_gpu)

### 1. Convolution
Convolutional Layer is the core building block of a conv nets.

#### Overview and intuition
The conv layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of some conv net might have size $3 \times 3 \times 3$ (i.e. 3 pixels width and height, and 3 because images have depth 3, RGB channels). During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an entire set of filters in each conv layer, and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.

#### Local connectivity
When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.

#### Spatial arrangement.
We have explained the connectivity of each neuron in the conv layer to the input volume, but we have not yet discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the depth, stride and zero-padding.

1. First, the **depth** of the output volume. It corresponds to the number of filters we would like to use, each learning to look for something different in the input. For example, if the first convolutional Layer takes as input the raw image, then different neurons along the depth dimension may activate in presence of various oriented edges, or blobs of color. We will refer to a set of neurons that are all looking at the same region of the input as a depth column.

2. Second, we must specify the **stride** with which we slide the filter. When the stride is $1$ then we move the filters one pixel at a time. When the stride is $2$ (or uncommonly $3$ or more, though this is rare in practice) then the filters jump $2$ pixels at a time as we slide them around. This will produce smaller output volumes spatially.

3. As we will soon see, sometimes it will be convenient to pad the input volume with zeros around the border. The size of this **zero-padding** is a hyperparameter. The nice feature of zero padding is that it will allow us to control the spatial size of the output volumes (exactly preserve the spatial size of the input volume so the input and output width and height are the same).

#### Implementation as matrix multiplication
Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the conv layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply. This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used [BLAS API](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms)).

#### Backpropagation
The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example.

#### Convolution $1 \times 1$
Several papers use $1 \times 1$ convolutions. Some people are at first confused to see $1 \times 1$ convolutions especially when they come from signal processing background. Normally signals are 2-dimensional so $1 \times 1$ convolutions do not make sense (it’s just pointwise scaling). However, we must remember that we operate over 3-dimensional volumes, and that the filters always extend through the full depth of the input volume. For example, if the input is $n \times n \times 3$ then doing $1 \times 1$ convolutions would effectively be doing 3-dimensional dot products (since the input depth is 3 channels).

### 2. Classic architecture
The crown of straightforward architectures for convolutional networks is probably [VGG](https://arxiv.org/pdf/1409.1556.pdf). In fact it is a chain of a fixed set of layers. The most common form of a conv net architecture stacks a few conv + ReLu layers, follows them with pool layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully connected layers. The last fully connected layer holds the output, such as the class scores. In more detail below.

#### Convolution layer
The conv layers should be using small filters (e.g. $3 \times 3$ or at most $5 \times 5$), using a stride equals $1$, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when filter size is $3$ then using padding equals $1$ to preserves the input size.

#### Pooling layer
Another important concept is pooling, which is a form of non-linear down-sampling. There are several non-linear functions to implement pooling among which max pooling is the most common. It partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum. The intuition is that the exact location of a feature is less important than its rough location relative to other features. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network, and hence to also control overfitting. It is common to periodically insert a pooling layer between successive convolutional layers in a CNN architecture. The pooling operation provides another form of translation invariance.

The pooling layer operates independently on every depth slice of the input and resizes it spatially. The most common form is a pooling layer with filters of size $2 \times 2$ applied with a stride of $2$ downsamples at every depth slice in the input by $2$ along both width and height. 

In addition to max pooling, the pooling units can use other functions, such as average pooling. Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which works better in practice. Due to the aggressive reduction in the size of the representation, the trend is towards using smaller filters or discarding the pooling layer altogether.

#### Activation function (nonlinearity)
ReLU is the abbreviation of Rectified Linear Units. This layer applies the non-saturating activation function 
$\sigma(x) = \max(0,x).$ It increases the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer. Other functions are also used to increase nonlinearity, for example the saturating hyperbolic tangent and the sigmoid function $f(x)=(1+e^{-x})^{-1}$. ReLU is often preferred to other functions, because it trains the neural network several times faster without a significant penalty to generalisation accuracy. Also there are several variations such as LeakyReLU or ELU.

#### Fully connected layer (dense layer)
Finally, after several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular neural networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset.

It is worth noting that the only difference between FC (fully connected layer) and conv layer is that the neurons in the conv layer are connected only to a local region in the input, and that many of the neurons in a conv volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it is possible to convert from FC to conv layers.

For example, an FC layer with output size $K$ that is looking at some input volume of size $S \times S \times F$ can be equivalently expressed as a conv layer with size $S$, padding $0$, stride $1$ and number of filters equals $K$.

#### Loss function
The loss layer specifies how training penalizes the deviation between the predicted and true labels and is normally the final layer. Various loss functions appropriate for different tasks may be used there. [Softmax](https://en.wikipedia.org/wiki/Softmax_function) is used for predicting a single class of $K$ mutually exclusive classes. [Sigmoid cross-entropy](https://www.tensorflow.org/api_docs/python/tf/losses/sigmoid_cross_entropy) is used for predicting $K$ independent probability values. [Euclidean loss](https://en.wikipedia.org/wiki/Root-mean-square_deviation) is used for regressing to real-valued labels.


#### Exercises
1. Download dataset from [kaggle](https://www.kaggle.com/c/ch-2017).
2. Suggest some your net architecture (start with something really simple).
3. What quality do you achieve?
4. Can you transform VGG model for your problem?
5. What is your score?
6. Imagine that your conv net makes forward path. How can you estimate your memory consumption?
7. Now you make backpropagation step. Why does it require much more memory?

In [1]:
import numpy as np
import matplotlib.pyplot as plt

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torch
import torchvision
import torchvision.transforms as transforms

import PIL
from PIL import Image

from torch.utils.data.dataset import Dataset
from torchvision import transforms

import pandas as pd 

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.backends.cudnn.benchmark = True

print(device)

cpu


### First, let's move indexes into realy adorable format (done in next notebook, don't think it's important to do it here as well)

In [2]:
correctIndexes = [59076, 59332, 59844, 60100, 59588, 60356, 60612, 61636, 61124, 62404, 63172, 61380, 59845, 60357, 60101, 59333, 59589, 60869, 64964, 62917, 61892, 63685, 63429, 63940, 64453, 64708, 59078, 63941, 64452, 62148, 61381, 62660, 63684, 62149, 61893, 63942, 61382, 65220, 62661, 63174, 59335, 64966, 59590, 59077, 61383, 61894, 62407, 60615, 60103, 60358, 59846, 63175, 60870, 63687, 63430, 64198, 62662, 61637, 64455, 64965, 62919, 62916, 62151, 62918, 60613, 64199, 64200, 63177, 64969, 61897, 65225, 59592, 59850, 60871, 62920, 62153, 59334, 60105, 62409, 60618, 63176, 62664, 59079, 63432, 61640, 65223, 60873, 59339, 64458, 62406, 64202, 63433, 60868, 62411, 62922, 59336, 61643, 62923, 59338, 63686, 59849, 62921, 64967, 59080, 60360, 61132, 65224, 64970, 65221, 63180, 62412, 64457, 62154, 64968, 60877, 63434, 62663, 65222, 61896, 60619, 64971, 62150, 63178, 61639, 63179, 62152, 61133, 62405, 62669, 59337, 59848, 60622, 59598, 60362, 63173, 64204, 60874, 59083, 61642, 62924, 59087, 64461, 63943, 64715, 62415, 64459, 64717, 61644, 61391, 62413, 59342, 63948, 58828, 60876, 59852, 61645, 62410, 58829, 63689, 62925, 60879, 64463, 58831, 61647, 63691, 60108, 62926, 64975, 61131, 64973, 59593, 64201, 62156, 61385, 61904, 63952, 65229, 59596, 64460, 59081, 64196, 64712, 63440, 59088, 62159, 59599, 61902, 60364, 61650, 59859, 58832, 59855, 64711, 58835, 64206, 64208, 59591, 64709, 64713, 63436, 61139, 60361, 59600, 62666, 60369, 59851, 62157, 62162, 61128, 63181, 61900, 62163, 63699, 63693, 60614, 58827, 62673, 60627, 62421, 61397, 59857, 60373, 60620, 63187, 62670, 61396, 65227, 60882, 64980, 58838, 59603, 62665, 61135, 59856, 64978, 59860, 63188, 62678, 60872, 64725, 61129, 61905, 64726, 64211, 63954, 63435, 59349, 60115, 60370, 64981, 63951, 64468, 60110, 59351, 61392, 62420, 65231, 61906, 62674, 60626, 60883, 64213, 63957, 59601, 63183, 62408, 61138, 64209, 60119, 59086, 60368, 62929, 60624, 64716, 62164, 59058, 63688, 60113, 62677, 61874, 64724, 60621, 65234, 64721, 64946, 63666, 61134, 60109, 59602, 59571, 62165, 61125, 60338, 60885, 63946, 60339, 63153, 61389, 60375, 59345, 64977, 65230, 61398, 62167, 59090, 63955, 65236, 62668, 60112, 61648, 60372, 63182, 62672, 63690, 59060, 64179, 59604, 60083, 63945, 64944, 59570, 64176, 64469, 63408, 63696, 59853, 59594, 61618, 59085, 63437, 64466, 60107, 59347, 64467, 59095, 62129, 60595, 64719, 64689, 62641, 63667, 63441, 64714, 62646, 62900, 59094, 59858, 62676, 63958, 58834, 59344, 61875, 63191, 61899, 63412, 64456, 60084, 59341, 64950, 64722, 64435, 60853, 61620, 61143, 59606, 64454, 63190, 59093, 59605, 59824, 62903, 64948, 59318, 64436, 61387, 61616, 59830, 61366, 62133, 64691, 60087, 61393, 65206, 60881, 63186, 59312, 59572, 63444, 63158, 64183, 64197, 62648, 60346, 65203, 61360, 60343, 65238, 63926, 60599, 60344, 64696, 62650, 61641, 64465, 62392, 62390, 60886, 60341, 61384, 63671, 61652, 63443, 64438, 64214, 59833, 63697, 60091, 59323, 58839, 59084, 61624, 62651, 60629, 59313, 64188, 60363, 62643, 63668, 63447, 59829, 60857, 63160, 62396, 63703, 59835, 61649, 62902, 59317, 63949, 61881, 60336, 60082, 64210, 63692, 62142, 60603, 59089, 64723, 62910, 59057, 64957, 61373, 61141, 63411, 65210, 59854, 62418, 61629, 61137, 63933, 64178, 59863, 63152, 60080, 61375, 64982, 60371, 59324, 59825, 62667, 58833, 59837, 62132, 61106, 59847, 61371, 61651, 64442, 60605, 59569, 62139, 65205, 62642, 63418, 60085, 60352, 65226, 59597, 61878, 61370, 62417, 61625, 63925, 60351, 61142, 62652, 60342, 61105, 63162, 61127, 64433, 62155, 59828, 63438, 65216, 63414, 64192, 63679, 63154, 61367, 62896, 62934, 63419, 61627, 62898, 62416, 61888, 61898, 64203, 61117, 63953, 64718, 59082, 62398, 60878, 59063, 63426, 62138, 65235, 63442, 61364, 61911, 64972, 61632, 60114, 61374, 62394, 62907, 64706, 62908, 61879, 59832, 63664, 62904, 60866, 63424, 62140, 60875, 63939, 61638, 64464, 62644, 61379, 61646, 64694, 65204, 64695, 59841, 62909, 64947, 64700, 60366, 63189, 64185, 64443, 64193, 62387, 64432, 59319, 60089, 60862, 63422, 60617, 60594, 64437, 61121, 62391, 63930, 62679, 62653, 60855, 64445, 62912, 59070, 65202, 63924, 61635, 64693, 63934, 62656, 62906, 64701, 64979, 60353, 62914, 61630, 59065, 63416, 62385, 61388, 63446, 63695, 60350, 63165, 59343, 63944, 63155, 59838, 63959, 62399, 60097, 61655, 60337, 61909, 60602, 61885, 59072, 64703, 64441, 59092, 61883, 65200, 61626, 63935, 64953, 61126, 63166, 62649, 63677, 60092, 62397, 64976, 62144, 62160, 59577, 59585, 60861, 64958, 61873, 63164, 62403, 64212, 62136, 62901, 61886, 63956, 62141, 63702, 63674, 60081, 62927, 60349, 64446, 63168, 63931, 63676, 61122, 62158, 63673, 64699, 61631, 63159, 64470, 60623, 62655, 61116, 62395, 64698, 64186, 60865, 64444, 63415, 65215, 61634, 63665, 64960, 63672, 63927, 60852, 63413, 59582, 61622, 61882, 64702, 64952, 62137, 60354, 63423, 64707, 60098, 61369, 63681, 63420, 60609, 60607, 64434, 59584, 60616, 63156, 62928, 59575, 60849, 62393, 62422, 61889, 59576, 65212, 64187, 59580, 63425, 60117, 60597, 58836, 65209, 64963, 60093, 61890, 60367, 63928, 61377, 64191, 60102, 63167, 63938, 59326, 64181, 63169, 64194, 64180, 64949, 60593, 59842, 59840, 64450, 60628, 60608, 63675, 60604, 61876, 61361, 61395, 60095, 64440, 59067, 64961, 61619, 61872, 59579, 60606, 61877, 64182, 60348, 60094, 64189, 63678, 60365, 59331, 63171, 61376, 60610, 62135, 63920, 60887, 63698, 59861, 62899, 61884, 61118, 62388, 65232, 61907, 60860, 61887, 61111, 60858, 64697, 65233, 61110, 61901, 62930, 64451, 60096, 64462, 59316, 61109, 63439, 62654, 60611, 64205, 62401, 60864, 60630, 62647, 61628, 64962, 63936, 63683, 60099, 65208, 59073, 63682, 62658, 59348, 61617, 59843, 59074, 62145, 62128, 60106, 59321, 63157, 60104, 59587, 63161, 61623, 64974, 64447, 61112, 64955, 63410, 60884, 63932, 59862, 64207, 61378, 60880, 62161, 59061, 62659, 62933, 59568, 61107, 63680, 59322, 60340, 63937, 62931, 62419, 64959, 64954, 62400, 59315, 59330, 62402, 64177, 62645, 59314, 59834, 60592, 64690, 60856, 62143, 59826, 63170, 62657, 61633, 63922, 62386, 59328, 62675, 63427, 59071, 63947, 61113, 61368, 62913, 61123, 61108, 65237, 59327, 63421, 65211, 63431, 64705, 60854, 63184, 64190, 59075, 65213, 63923, 60359, 61654, 59350, 59839, 62130, 63669, 62389, 63445, 59091, 60118, 60355, 59320, 61908, 64184, 59062, 61394, 61386, 64439, 62911, 63163, 62423, 60347, 59595, 62147, 59346, 61104, 60086, 64951, 60374, 60601, 64720, 63185, 60345, 60867, 63700, 61363, 60863, 61130, 61140, 62897, 58830, 61399, 61119, 64449, 62915, 59831, 60859, 61114, 61880, 63409, 60111, 59064, 59329, 61910, 59836, 59325, 65207, 62932, 63670, 60625, 60090, 65219, 60631, 60596, 61120, 59056, 62640, 65217, 61136, 60088, 59068, 59059, 59586, 59574, 59607, 63921, 59069, 63929, 61365, 64195, 59578, 59340, 62671, 60850, 60600, 65201, 62414, 64945, 62384, 61362, 64710, 59583, 64688, 64448, 61895, 59581, 65218, 62131, 62146, 58837, 61891, 63701, 61115, 61390, 59573, 61903, 61621, 59066, 64692, 60851, 60848, 60598, 60116, 62166, 65214, 61653, 62134, 64956, 63417, 59827]

### Dataset class

In [3]:
class DatasetFromImagesTrain(Dataset):
    def __init__(self, img_paths, size):
        # Read data
        self.data_info = np.concatenate([np.load(file) for file in img_paths])

        # First column contains images
        self.image_arr = np.asarray(self.data_info[:, 0])
        
        # Second column is the labels
        self.label_arr = np.asarray(self.data_info[:, 1])
        
        # Calculate len
        self.data_len = (self.data_info).shape[0]
        
        #Shapes
        self.size = size

    def __getitem__(self, index):
        # Get ndarray from index
        img_as_ndarray = self.image_arr[index]
        
        # Open image
        img_as_img = Image.fromarray(img_as_ndarray)
        
        # Reshaping
        img_as_img = img_as_img.resize((self.size, self.size), PIL.Image.HAMMING)
        
        # Transform image to tensor
        img_as_tensor = torch.from_numpy(np.asarray(img_as_img, dtype=np.float32))
        
        # Preproc
        img_as_tensor = img_as_tensor.unsqueeze(0)
        
        # Get label(class) of the image based on the cropped pandas column
        single_image_label = correctIndexes.index(self.label_arr[index])

        return (img_as_tensor, single_image_label)

    def __len__(self):
        return self.data_len


### Net

In [8]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.conv3 = nn.Conv2d(16, 32, 3)
        self.conv4 = nn.Conv2d(32, 32, 3)
        
        self.fc1 = nn.Linear(128, 84)
        self.fc2 = nn.Linear(84, 1000)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = self.pool(F.relu(self.conv4(x)))
        x = x.view(-1, 128)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

### Setting params

In [9]:
net = Net()
net = net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.0005)
#optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)

## Training process

### DataLoader

In [6]:
data = DatasetFromImagesTrain(['Ieroglifs/train-1.npy', 'Ieroglifs/train-4.npy', 'Ieroglifs/train-3.npy', 'Ieroglifs/train-2.npy'], 68)
trainLoader = torch.utils.data.DataLoader(dataset=data, batch_size=128, shuffle=False)

In [20]:
for epoch in range(24):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainLoader, 0):
        # get the inputs
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        # print statistics
        running_loss += loss.item()
        if i % 20 == 19:
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 20))
            running_loss = 0.0
print('Finished Training')

[1,    20] loss: 0.049
[1,    40] loss: 0.045
[1,    60] loss: 0.049
[1,    80] loss: 0.047
[1,   100] loss: 0.047
[1,   120] loss: 0.056
[1,   140] loss: 0.045
[1,   160] loss: 0.055
[1,   180] loss: 0.053
[1,   200] loss: 0.051
[1,   220] loss: 0.068
[1,   240] loss: 0.063
[1,   260] loss: 0.080
[1,   280] loss: 0.078
[1,   300] loss: 0.093
[1,   320] loss: 0.112
[1,   340] loss: 0.101
[1,   360] loss: 0.098
[1,   380] loss: 0.108
[1,   400] loss: 0.157
[1,   420] loss: 0.147
[1,   440] loss: 0.197
[1,   460] loss: 0.143
[1,   480] loss: 0.173
[1,   500] loss: 0.190
[1,   520] loss: 0.220
[1,   540] loss: 0.219
[1,   560] loss: 0.246
[1,   580] loss: 0.218
[1,   600] loss: 0.218
[1,   620] loss: 0.224
[1,   640] loss: 0.214
[1,   660] loss: 0.231
[1,   680] loss: 0.215
[1,   700] loss: 0.225
[1,   720] loss: 0.286
[1,   740] loss: 0.283
[1,   760] loss: 0.260
[1,   780] loss: 0.304
[1,   800] loss: 0.248
[1,   820] loss: 0.232
[1,   840] loss: 0.246
[1,   860] loss: 0.270
[1,   880] 

KeyboardInterrupt: 

In [13]:
class CustomDatasetFromImages(Dataset):
    def __init__(self, img_paths, size):
        # Read data
        self.data_info = np.concatenate([np.load(file) for file in img_paths])

        # First column contains images
        self.image_arr = np.asarray(self.data_info[:])
        
        # Calculate len
        self.data_len = (self.data_info).shape[0]
        
        #Shapes
        self.size = size

    def __getitem__(self, index):
        # Get ndarray from index
        img_as_ndarray = self.image_arr[index]
        
        # Open image
        img_as_img = Image.fromarray(img_as_ndarray)
        
        # Reshaping
        img_as_img = img_as_img.resize((self.size, self.size), PIL.Image.HAMMING)
        
        # Transform image to tensor
        img_as_tensor = torch.from_numpy(np.asarray(img_as_img, dtype=np.float32))
        
        # Preproc
        img_as_tensor = img_as_tensor.unsqueeze(0)

        return img_as_tensor

    def __len__(self):
        return self.data_len


In [14]:
dataTest = CustomDatasetFromImages(['Ieroglifs/test.npy'], 68)
testLoader = torch.utils.data.DataLoader(dataset=dataTest, batch_size=1, shuffle=False)

In [None]:
correctTest = 0
totalTest = 0
res = []
with torch.no_grad():
    for data in testLoader:
        images = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        totalTest += 1
        res.append(predicted[0])
        if totalTest % 10000 == 9999:
            print(totalTest + 1)

10000
20000
30000
40000


In [16]:
ans = [correctIndexes[res[i]] for i in range(len(res))]


In [17]:
ansWide = []
for i in range(len(ans)):
    ansWide.append([i + 1, ans[i]])

In [18]:
res = np.asarray(ansWide)

In [19]:
df = pd.DataFrame(res)
df.to_csv("resCSV.csv", header=["Id", "Category"], index=False)

In [11]:
torch.save(net.state_dict(), "850000% nn.txt")

In [26]:
net = Net()
net.load_state_dict(torch.load("85% nn.txt"))
net.eval()

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (conv3): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=1152, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=1000, bias=True)
)

### 3. Regularization
Regularization is a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting. It is possible to use various types of regularization for conv nets. You are already familiar with the classic $L1$ and $L2$ regularization, consider a more specific techniques.

#### Early stopping
One more method to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur. It comes with the disadvantage that the learning process is halted. Also it is common solution to slowly decreace learning rate.

#### Dropout
Because a fully connected layer occupies most of the parameters, it is prone to overfitting. One method to reduce overfitting is [dropout](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf). At each training stage, individual nodes are either "dropped out" of the net with probability $1-p$ or kept with probability $p$, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Only the reduced network is trained on the data in that stage. The removed nodes are then reinserted into the network with their original weights. In the training stages, the probability that a hidden node will be dropped is usually $0.5$, for input nodes, this should be much lower, intuitively because information is directly lost when input nodes are ignored.

At testing time after training has finished, we would ideally like to find a sample average of all possible $2^{n}$ dropped-out networks, unfortunately this is unfeasible for large values of $n$. However, we can find an approximation by using the full network with each node's output weighted by a factor of $p$, so the expected value of the output of any node is the same as in the training stages. This is the biggest contribution of the dropout method: although it effectively generates $2^{n}$ neural nets, and as such allows for model combination, at test time only a single network needs to be tested.

By avoiding training all nodes on all training data, dropout decreases overfitting. The method also significantly improves training speed. This makes model combination practical, even for deep neural nets. The technique seems to reduce node interactions, leading them to learn more robust features that better generalize to new data.

#### Stochastic pooling
A major drawback to dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected.

In [stochastic pooling](https://arxiv.org/abs/1301.3557), the conventional deterministic pooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region. The approach is hyperparameter free and can be combined with other regularization approaches, such as dropout and data augmentation.

An alternate view of stochastic pooling is that it is equivalent to standard max pooling but with many copies of an input image, each having small local deformations. This is similar to explicit elastic deformations of the input images, which delivers excellent MNIST performance. Using stochastic pooling in a multilayer model gives an exponential number of deformations since the selections in higher layers are independent of those below.

#### Batch normalization
[Batch normalization](https://arxiv.org/pdf/1502.03167.pdf) is a method for improving the performance and stability of neural networks, and also makes more sophisticated deep learning architectures work in practice.
The idea is to normalise the inputs of each layer in such a way that they have a mean output activation of zero and standard deviation of one. This is analogous to how the inputs to networks are standardised.

How does this help? We know that normalising the inputs to a network helps it learn. But a network is just a series of layers, where the output of one layer becomes the input to the next. That means we can think of any layer in a neural network as the first layer of a smaller subsequent network. Thought of as a series of neural networks feeding into each other, we normalising the output of one layer before applying the activation function, and then feed it into the following layer. It’s called "batch" normalization because during training, we normalise the activations of the previous layer for each batch, i.e. apply a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

1. Networks train faster – whilst each training iteration will be slower because of the extra normalisation calculations during the forward pass and the additional hyperparameters to train during back propagation. However, it should converge much more quickly, so training should be faster overall.
2. Allows higher learning rates — gradient descent usually requires small learning rates for the network to converge. As networks get deeper, gradients get smaller during back propagation, and so require even more iterations. Using batch normalisation allows much higher learning rates, increasing the speed at which networks train.
3. Makes weights easier to initialise — weight initialisation can be difficult, especially when creating deeper networks. Batch normalisation helps reduce the sensitivity to the initial starting weights.
4. Makes more activation functions viable — some activation functions don’t work well in certain situations. Sigmoids lose their gradient quickly, which means they can’t be used in deep networks, and ReLUs often die out during training (stop learning completely), so we must be careful about the range of values fed into them.
5. Provides some regularisation — batch normalisation adds a little noise to your network, and in some cases, (e.g. Inception modules) it has been shown to work as well as dropout. You can consider batch normalisation as a bit of extra regularization, allowing you to reduce some of the dropout you might add to a network.

#### Exercises
1. Try to use regularization that you like more (batch normalization is strongly recommended).
2. Does it help to improve the quality of classification? What methods do you use?

### 4. Data augmentation
Data augmentation is another way we can reduce overfitting on models, where we increase the amount of training data using information only in our training data. It is common knowledge that the more data an ML algorithm has access to, the more effective it can be. Even when the data is of lower quality, algorithms can actually perform better, as long as useful data can be extracted by the model from the original data set. For example, text-to-speech and text-based models have improved significantly due to the release of a trillion-word corpus by Google. This result is despite the fact that the data is collected from unfiltered web pages and contains many errors. With such large and unstructured data sets, however, the task becomes one of finding structure within a sea of unstructured data.

However, alternative approaches exist. Rather than starting with an extremely large corpus of unstructured and unlabeled data, can we instead take a small, curated corpus of structured data and augment in a way that increases the performance of models trained on it? This approach has proven effective in multiple problems.

A very generic and accepted current practice for augmenting image data is to perform geometric and color augmentations, such as reflecting the image, cropping and translating the image, and changing the color palette of the image. Specifically, digit data was augmented with elastic deformations, in addition to the typical affine transformation.

#### Exercises
1. Try to use some simple augmentation techniques, e.g. rotation, scaling and etc.
2. Does it help to improve the quality of classification?
3. You can read [this paper](http://cs231n.stanford.edu/reports/2017/pdfs/300.pdf) for more information.

### 5. Modern architecture
In practice, it is better to use whatever works best on [ImageNet](http://www.image-net.org). If you’re feeling a bit of a fatigue in thinking about the architectural decisions, you will be pleased to know that in 90% or more of applications you should not have to worry about these. Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch.

It should be noted that the conventional paradigm of a linear list of layers has recently been challenged, in Google’s inception architectures and also in current (state of the art) residual networks from Microsoft Research Asia. Both of these (see details below) feature more intricate and different connectivity structures.

**GoogLeNet** The ILSVRC 2014 winner was a Convolutional Network from Google. Its main contribution was the development of an [inception module](https://arxiv.org/pdf/1409.4842) that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses average pooling instead of fully connected layers at the top of the net, eliminating a large amount of parameters that do not seem to matter much. There are also several versions to the GoogLeNet, most recently [inception-v4](https://arxiv.org/pdf/1602.07261).

**ResNet** [Residual networks](https://arxiv.org/pdf/1512.03385) developed was the winner of ILSVRC 2015. It features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. ResNets are currently by far state of the art conv net models and are the default choice for using in practice. In particular, also see more recent developments that tweak the original architecture, e.g. in [this paper](https://arxiv.org/pdf/1603.05027).

#### Exercises
1. Try to adopt modern architecture for your task.
2. Please, explain your decision. What problems have you encountered?