**DEEP LEARNING IN FIVE DAYS**

*Patrick Donnelly*

Day one: Defining a use case for deep learning

Chapter one: Introduction and image classification

In *deep learning* (DL), a team of researchers or engineers use *artifical neural networks* (ANNs) to learn models from data (*training*). These models are then used to generate predictions from data (*inference*). In other words: 

`training: data + network = model` and

`inference: model + data = predictions`

Strictly speaking, a *prediction* refer to a forecast, or "statement about a future event" (https://en.wikipedia.org/wiki/Prediction). We'll use the term more loosely to refer to the output of passing data through a trained model.

In this workshop, we'll examine the primary use cases for deep learning. We'll do this kinda backwards, identifying use cases for algorithms rather than vice versa. Ultimately, it doesn't matter whether we start with use cases or with algorithms. We just need to come up with a mapping of one to the other.

Throughout this tutorial, we will be using *PyTorch*, a *framework* for training *deep neural networks*. We'll get into the details of the mathematical operations involved in training neural networks later. For now, just know that a *framework* defines these operations at a high level of abstraction. It'll make our code less verbose, and it'll enable engineers to build neural networks without explictly defining and optimizing all the operations, for better or worse. 

PyTorch is written in Python with a C++ backend and optimized acceleration for NVIDIA GPUs with CUDA and cuDNN. CUDA is NVIDIA's software platform for parallel processing with CPUs and NVIDIA GPUs. cuDNN optimizes the operations for training neural networks for NVIDIA GPUs. You can run PyTorch on CPU-only systems, but you'll quickly find that you'll need GPU or some form of parallel processor if you want to train neural networks with millions of parameters on gigabytes of data in a reasonable amount of time.

We import PyTorch as `torch`. There's another framework out there called Torch, on which PyTorch is loosely based. It's written in Lua and ceased active development in 2018 (see https://en.wikipedia.org/wiki/Torch_(machine_learning). PyTorch is also influenced by NumPy. You'll see that a lot of `torch` is analogous or even identical to `numpy`, and a lot of the `torch.nn` module for constructing neural networks is analogous or identical to the `nn` module in Lua Torch.

In [1]:
import torch

Now let's begin mapping out the use cases for deep learning. We'll start with perhaps the best-known set of DL applications: the domain of computer vision (CV). Our input for each of these examples is an array of pixels, which we typically represent as a *tensor* data type in PyTorch. For instance, the commonly-used MNIST dataset consists of 60,000 training images and 10,000 test images. Remember, it's necessary to partition our dataset into training and test data. We can't evaluate how our model is doing with the data used for training the model. Machine learning 101!

Each image in the MNIST dataset is 28 pixels high and 28 pixels wide. These are grayscale images, and thus only have one color channel, in "contrast" to e.g. RGB images that have three color channels. Our tensor takes four dimensions: batch size, channels, height, and width. We now know that `channels` is 1, `height` is 28, and `width` is 28.

`batch_size` tells us how many images to feed to the network at once. Let's go with a single image for now, but if we want to exploit the parallelism of a GPU, not to mention multiple GPUs and multiple nodes of multiple GPUs, we'll want to increase our batch size. Typically this is done in multiples of two (since we're software engineers) until saturating GPU memory (then you can decrease by n/4, increase by n/8, etc.)

Let's create a random 28x28 image. We'll let each pixel take values between 0 and 255 inclusive (8 bits). To generate this image, we'll pass the parameters `0` (lower bound, inclusive), `256` (upper bound, exclusive), and `(28,28)` (width and height of array) to `torch.randint`. Throughout, we'll use `x` to denote an input datum:

In [2]:
x = torch.randint(0, 256, (28,28))
x

tensor([[135, 208,  73,  90,  53, 101,  32,  40,  92, 108, 183, 255, 110, 103,
           2,  30, 205, 111, 233, 226, 108, 189, 101, 155, 233, 105,   8,   5],
        [143, 183,  51,  99, 105, 143,  73, 231, 254, 175, 125,  84, 164, 210,
          38, 129,  56, 249, 118, 169,  42, 125, 227, 203, 164, 215, 249, 168],
        [ 83,  89,  69,  87,  19, 213, 153, 181, 200,  87, 211, 130, 111, 190,
          86,  75,  57, 161, 197,  53,  24, 103,  12, 174, 226, 144,  64, 195],
        [162, 118,  41, 212, 118,  83, 146,  91, 227, 178,  36, 231, 105, 165,
         156,  95, 185,  96, 250, 141, 177,  13, 106, 203, 150, 247, 103, 180],
        [249,   5,  76,  47,  75, 170,  54, 151,  87, 105, 155,  85, 210,  97,
         248,  82, 166, 250, 107,  44, 213, 242, 164,  75,  26, 233,  72,  45],
        [230,  91,  63, 108,  38, 221, 228,  89, 111,  25, 226,  94, 109,  73,
         179, 242, 123, 147,  90, 158, 231, 157, 201, 236, 206, 199, 207, 138],
        [ 76, 169, 150, 127, 219, 191, 119,  7

We can investigate the shape of `x` with `x.shape`: 

In [3]:
print(x.shape)
type(x)

torch.Size([28, 28])


torch.Tensor

We also have to *pad* our original image with four layers of zeroes before passing it through LeNet-5. This is because LeNet-5 expects a 32x32 image as input, and our image is of size 28x28. We can do this with `numpy.pad`:

In [4]:
import numpy as np
x = np.pad(x, (2,2), 'constant')
print(x.shape)
print(x)
type(x)

(32, 32)
[[  0   0   0 ...   0   0   0]
 [  0   0   0 ...   0   0   0]
 [  0   0 135 ...   5   0   0]
 ...
 [  0   0  68 ... 130   0   0]
 [  0   0   0 ...   0   0   0]
 [  0   0   0 ...   0   0   0]]


numpy.ndarray

Whoops, looks like applying np.pad converted our `torch.Tensor` to an `np.ndarray`. Fortunately, it's really easy to convert data stored in an `ndarray` to a `Tensor`. We should also note that (for object `x`) `type(x)` is a Python function that returns the object type (in this case a `torch.Tensor`), while `x.type()` is a `torch.Tensor` method that returns the more specific data type (in this case a `torch.FloatTensor`, or array of 32-bit floating point data):

In [5]:
x = torch.Tensor(x)
print(x.shape)
print(x)
print(x.type())
type(x)

torch.Size([32, 32])
tensor([[  0.,   0.,   0.,  ...,   0.,   0.,   0.],
        [  0.,   0.,   0.,  ...,   0.,   0.,   0.],
        [  0.,   0., 135.,  ...,   5.,   0.,   0.],
        ...,
        [  0.,   0.,  68.,  ..., 130.,   0.,   0.],
        [  0.,   0.,   0.,  ...,   0.,   0.,   0.],
        [  0.,   0.,   0.,  ...,   0.,   0.,   0.]])
torch.FloatTensor


torch.Tensor

Now we can reshape the image with the `view` method. We pass our dimensions `(batch size, channels, height, width)` as parameters:

In [6]:
x = x.view(1, 1, 32, 32)
print(x.shape)
x

torch.Size([1, 1, 32, 32])


tensor([[[[  0.,   0.,   0.,  ...,   0.,   0.,   0.],
          [  0.,   0.,   0.,  ...,   0.,   0.,   0.],
          [  0.,   0., 135.,  ...,   5.,   0.,   0.],
          ...,
          [  0.,   0.,  68.,  ..., 130.,   0.,   0.],
          [  0.,   0.,   0.,  ...,   0.,   0.,   0.],
          [  0.,   0.,   0.,  ...,   0.,   0.,   0.]]]])

Let's consider first the task of *image classification*, also known as *object recognition*. The goal here is to take our input image and identify it as belonging to one of a number of classes. To take a canonical example for a canonical dataset, our classes could correspond to digits. Our output in this case would be a vector of length 10 (the number of digits). Let's further assume that our random image is the digit `4` (it doesn't look much like a 4, but nevermind). We would thus represent our *target* output as a *one-hot-encoding*. Think of this as an array of zeroes where the index of the *ground truth* class - the class corresponding to the label of the image - is flipped from zero to one. Let's call this vector `y`:

In [7]:
y = [0,0,0,0,1,0,0,0,0,0]
y

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

PyTorch makes it even easier to define class labels. Just define a one-dimensional tensor with value equal to the class index. We'll assign that value to `y` (replacing the list from our prior code block):

In [8]:
y = torch.Tensor([4])
print(y.type())
y

torch.FloatTensor


tensor([4.])

Although neural networks in PyTorch expect a `torch.FloatTensor` input, our label `y` needs to be a 64-bit "long" integer (`torch.LongTensor`). Let's cast `y` to `long`:

In [9]:
y = y.long()
print(y.type())
y

torch.LongTensor


tensor([4])

Excellent! Now we need to build a *neural network* to generate a model output from our input `x`. We'll call our model output `Y`. Our model output will be a *probability vector* whose (non-negative) entries sum to one (see https://en.wikipedia.org/wiki/Probability_vector). Interestingly, a one-hot encoding is a type of probability vector, despite its deterministic appearance (probability vectors are also konwn as "stochastic vectors").

To generate this model output, we'll apply a series of operations to our input. We'll get into the details more later. For now, let's consider the seminal network *LeNet-5* as an example of a *convolutional neural network* for image classification. Named for Yann LeCun, one of the "founding fathers" of deep learning, LeNet-5 contains *linear*, also known as *dense* or *fully-connected* operations (matrix multiplications with optional additions), as well as (two-dimensional) *convolutional* operations, which generate feature maps by sliding a *kernel* (also known as a *filter*) across the input(s) with a particular *stride*.

We also apply *rectified linear unit (ReLU)* activation functions, which zero out negative values before passing the output on to the next *layer* of operations. Finally, we apply *max pooling*, which downsamples our input values by taking the maximum value over a specified region.

Before doing all this, we need to import a couple more modules. `torch.nn` gives us the modules and classes we need to define our layers. `torch.nn.functional` helps us with our pooling and activation functions. By convention:

In [10]:
import torch.nn as nn
import torch.nn.functional as F

Now we can define our network. This is based on https://github.com/kuangliu/pytorch-cifar/blob/master/models/lenet.py and modified to take one-channel inputs:

In [11]:
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(-1, 16*5*5)
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)   
        return x
    
net = LeNet()
print(net)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


Now let's pass `x` through `net`. Again, we'll call our model output `Y`:

In [12]:
Y = net(x)
print(Y.shape)
Y

torch.Size([1, 10])


tensor([[ -3.6606,   5.5909, -10.9270,  -8.1490, -12.8929,  12.9511,   4.3634,
           5.5695,  -6.7247,   1.1291]], grad_fn=<AddmmBackward>)

Notice that our output is a 1x10 tensor, analogous to a vector of length 10. Note also that our tensor is not a probability vector. We can convert it to a probability vector using `torch.softmax`. However, when we update our model parameters (a few cells down), we will pass our raw output `Y`, not our probability vector `torch.softmax(Y, dim = 1)`:

In [13]:
torch.softmax(Y, dim = 1)

tensor([[6.0954e-08, 6.3517e-04, 4.2586e-11, 6.8501e-10, 5.9633e-12, 9.9855e-01,
         1.8612e-04, 6.2173e-04, 2.8463e-09, 7.3305e-06]],
       grad_fn=<SoftmaxBackward>)

Note that our cells sum to 1. We have to do `sum` twice, since we sum over two dimensions:

In [14]:
sum(sum(torch.softmax(Y, dim = 1)))

tensor(1.0000, grad_fn=<AddBackward0>)

Now we can update our model parameters. First we'll import the `torch.optim` package, which will enable us to define our optimizer function:

In [15]:
import torch.optim as optim

Now let's *backpropagate* (update our weights and biases):

In [16]:
optimizer = optim.SGD(net.parameters(), lr=0.1)
ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(Y, y)
loss.backward()
optimizer.step()

We've skimmed over a lot of the details in defining a basic template. Let's now consider some use cases!

While there are many applications of image classification, most of them require additional algorithms. For instance, *optical character recognition* (OCR) involves first *detecting* an object within a larger image and then classifying the image. Although an image classification network such as LeNet-5 could be used to identify handwritten digits, we would need another algorithm to identify objects in an image as "digits" before applying a classification network such as LeNet-5 to classify these images as one digit or another. (There are other ways to do object detection - this is just an example.)

Perhaps the most famous application of convolutional neural networks for image classification is the *ImageNet Large Scale Visual Recognition Challenge* (ILSVRC). ILSVRC is an annual competition to evaluate the accuracy of CNNs in classifying approximately 140 GB (for ILSVRC 2012) of color images as belonging to one of 1000 classes (to be precise, the network gets five guesses...) This may seem kinda odd. How can a competition between research teams be more influential for applied deep learning than, uh, an actual application? But without the dominance of CNNs for image classification, beginning in 2012 with AlexNet producing half the error rate of non-ANN algorithms, we wouldn't have seen the proliferation of more practical deep learning applications.

The three instances cited here each involve more than simple classification. However, each are common applications which could not be done efficiently without a convolutional neural network for image classification:

1. Reading bank checks: LeCun et al. (1998) propose an application of LeNet-5(!!) for identifying the handwritten digits in deposited checks. This system was "deployed commercially and read several million checks per day" (http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) 

2. Image search (eBay): Take a photo with your phone and upload it to eBay's mobile app. eBay will identify the photo and search for similar items. While the *search* part of the application goes beyond simple classification, the classification is presumably done using an image classifcation network (https://www.ebayinc.com/stories/news/find-it-on-ebay-using-pictures-instead-of-words/)

3. Facial recognition (Facebook): Facebook used applied facial recognition for "tagging" friends well before the technology extended to controversial applications such as law enforcement body cameras and airport check-ins (https://www.facebook.com/notes/facebook/making-photo-tagging-easier/467145887130/)

In the next section, we'll examine neural networks for more complex computer vision tasks. Many of these follow logically from classification. See you there!