In [1]:
import torch
torch.__version__

'1.7.1'

# A first look at a neural network

This notebook contains the code samples found in Chapter 2, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

We will now take a look at a first concrete example of a neural network, which makes use of the Python library PyTorch to learn to classify 
hand-written digits. Unless you already have experience with PyTorch or similar libraries, you will not understand everything about this 
first example right away. You probably haven't even installed PyTorch yet. Don't worry, that is perfectly fine. In the next chapter, we will 
review each element in our example and explain them in detail. So don't worry if some steps seem arbitrary or look like magic to you! 
We've got to start somewhere.

The problem we are trying to solve here is to classify grayscale images of handwritten digits (28 pixels by 28 pixels), into their 10 
categories (0 to 9). The dataset we will use is the MNIST dataset, a classic dataset in the machine learning community, which has been 
around for almost as long as the field itself and has been very intensively studied. It's a set of 60,000 training images, plus 10,000 test 
images, assembled by the National Institute of Standards and Technology (the NIST in MNIST) in the 1980s. You can think of "solving" MNIST 
as the "Hello World" of deep learning -- it's what you do to verify that your algorithms are working as expected. As you become a machine 
learning practitioner, you will see MNIST come up over and over again, in scientific papers, blog posts, and so on.

The MNIST dataset comes pre-loaded in PyTorch, which can be extracted as follows:

In [2]:
import torch
from torchvision import datasets

dir='./dataset'
train_data = datasets.MNIST(dir, train=True, download=True)
test_data = datasets.MNIST(dir, train=False)

The `train_data` is the "training set" that the model will learn from. The model will then be tested on the 
"test set", `test_data`. Both `train_data` and `test_data` are composed of a set of sample images (`data`) and their corresponding labels (`train_labels` / `test_labels`), which is an array of digits ranging from 0 to 9. There is a one-to-one correspondence between the images and the labels.

Let's have a look at the training data:

In [3]:
train_data.data.shape

torch.Size([60000, 28, 28])

In [4]:
len(train_data.train_labels)

60000

In [5]:
train_data.train_labels

tensor([5, 0, 4,  ..., 5, 6, 8])

Let's have a look at the test data:

In [6]:
test_data.data.shape

torch.Size([10000, 28, 28])

In [7]:
len(test_data.test_labels)

10000

In [8]:
test_data.test_labels

tensor([7, 2, 1,  ..., 4, 5, 6])

Our workflow will be as follow: first we will present our neural network with the training data, `train_data`. The 
network will then learn to associate images and labels. Finally, we will ask the network to produce predictions for `test_data`, and we will verify if these predictions match the labels from `test_data`.

Let's build our network -- again, remember that you aren't supposed to understand everything about this example just yet.

In [9]:
import torch.nn as nn
import torch.nn.functional as F

# Construct model
class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()

        self.fc1 = nn.Linear(28*28, 512)
        self.fc2 = nn.Linear(512, 10)

    def forward(self, x):
        x = x.view(-1, 28*28)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

model = Network()
model.train()

Network(
  (fc1): Linear(in_features=784, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=10, bias=True)
)


The core building block of neural networks is the "layer", a data-processing module which you can conceive as a "filter" for data. Some 
data comes in, and comes out in a more useful form. Precisely, layers extract _representations_ out of the data fed into them -- hopefully 
representations that are more meaningful for the problem at hand. Most of deep learning really consists of chaining together simple layers 
which will implement a form of progressive "data distillation". A deep learning model is like a sieve for data processing, made of a 
succession of increasingly refined data filters -- the "layers".

Here our network consists of a sequence of two densely-connected (also called "fully-connected") neural layers. 
The second (and last) layer is a 10-way "softmax" layer, which means it will return an array of 10 probability scores (summing to 1). Each 
score will be the probability that the current digit image belongs to one of our 10 digit classes.

To make our network ready for training, we need to pick three more things:

* A loss function: this is how the network will be able to measure how good a job it is doing on its training data, and thus how it will be able to steer itself in the right direction.
* An optimizer: this is the mechanism through which the network will update itself based on the data it sees and its loss function.
* Metrics to monitor during training and testing. Here we will only care about accuracy (the fraction of the images that were correctly 
classified).

Before defining the elements above, we need to initialize the orca context:


In [10]:
from zoo.orca import init_orca_context, stop_orca_context
from zoo.orca import OrcaContext

# recommended to set it to True when running Analytics Zoo in Jupyter notebook. 
OrcaContext.log_output = True # (this will display terminal's stdout and stderr in the Jupyter notebook).

cluster_mode = "local"

if cluster_mode == "local":
    init_orca_context(cores=1, memory="2g")   # run in local mode
elif cluster_mode == "k8s":
    init_orca_context(cluster_mode="k8s", num_nodes=2, cores=4) # run on K8s cluster
elif cluster_mode == "yarn":
    init_orca_context(
        cluster_mode="yarn-client", cores=4, num_nodes=2, memory="2g",
        driver_memory="10g", driver_cores=1,
        conf={"spark.rpc.message.maxSize": "1024",
              "spark.task.maxFailures": "1",
              "spark.driver.extraJavaOptions": "-Dbigdl.failure.retryTimes=1"})   # run on Hadoop YARN cluster

Specify loss function, optimizer, metrics, as well as the batch size for training/testing (number of samples utilized in one iteration):

In [11]:
from zoo.orca.learn.metrics import Accuracy

criterion = nn.NLLLoss()                                # Loss function
adam = torch.optim.Adam(model.parameters(), 0.001)      # Optimizer
metrics=[Accuracy()]                                    # Metrics

train_batch_size=320
test_batch_size=320

creating: createZooKerasAccuracy


To load the data for training and evaluation, we can use Pytorch DataLoader. The data should be normalized before the training process, and the training data needs to be shuffled so that the model can converge faster.

In [12]:
from torchvision import transforms

torch.manual_seed(0)

train_loader = torch.utils.data.DataLoader(
        datasets.MNIST(dir, train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size= train_batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(
        datasets.MNIST(dir, train=False,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=test_batch_size, shuffle=False)

We are now ready to train our network using Orca Pytorch Estimator:

In [13]:
from zoo.orca.learn.pytorch import Estimator 
from zoo.orca.learn.trigger import EveryEpoch 

est = Estimator.from_torch(model=model, optimizer=adam, loss=criterion, metrics=metrics)
est.fit(data=train_loader, epochs=5, validation_data=test_loader, batch_size=train_batch_size, checkpoint_trigger=EveryEpoch())

...
2021-02-22 17:31:49 INFO  DistriOptimizer$:427 - [Epoch 5 58560/60160][Iteration 935][Wall Clock 85.759594242s] Trained 320.0 records in 0.083955904 seconds. Throughput is 3811.5247 records/second. Loss is 0.06152063. 
2021-02-22 17:31:49 INFO  DistriOptimizer$:427 - [Epoch 5 58880/60160][Iteration 936][Wall Clock 85.835908327s] Trained 320.0 records in 0.076314085 seconds. Throughput is 4193.1973 records/second. Loss is 0.032923408. 
2021-02-22 17:31:49 INFO  DistriOptimizer$:427 - [Epoch 5 59200/60160][Iteration 937][Wall Clock 85.913411948s] Trained 320.0 records in 0.077503621 seconds. Throughput is 4128.8394 records/second. Loss is 0.07016848. 
2021-02-22 17:31:49 INFO  DistriOptimizer$:427 - [Epoch 5 59520/60160][Iteration 938][Wall Clock 85.992577896s] Trained 320.0 records in 0.079165948 seconds. Throughput is 4042.1418 records/second. Loss is 0.03392674. 
2021-02-22 17:31:49 INFO  DistriOptimizer$:427 - [Epoch 5 59840/60160][Iteration 939][Wall Clock 86.071633508s] Trained

<zoo.orca.learn.pytorch.estimator.PyTorchSparkEstimator at 0x7fdc00eea790>

The "loss" of the network over the training data is displayed during training. The training accuracy can be obtained by evaluating the model upon the training set:

In [14]:
train_result = est.evaluate(data=train_loader, batch_size=train_batch_size)
print(train_result)

[Stage 1897:>                                                       (0 + 1) / 1]2021-02-22 17:32:11 INFO  DistriOptimizer$:1759 - Top1Accuracy is Accuracy(correct: 59548, count: 60000, accuracy: 0.9924666666666667)
{'Top1Accuracy': 0.9924666881561279}


We can see that the training accuracy is 0.992 (i.e. 99.2%). Now let's check if our model performs well on the test set too:

In [15]:
test_result = est.evaluate(data=test_loader, batch_size=test_batch_size)
print(test_result)

[Stage 1899:>                                                       (0 + 1) / 1]2021-02-22 17:32:14 INFO  DistriOptimizer$:1759 - Top1Accuracy is Accuracy(correct: 9791, count: 10000, accuracy: 0.9791)
{'Top1Accuracy': 0.9790999889373779}



Our test set accuracy turns out to be 97.9% -- that's quite a bit lower than the training set accuracy. 
This gap between training accuracy and test accuracy is an example of "overfitting", 
the fact that machine learning models tend to perform worse on new data than on their training data. 
Overfitting will be a central topic in chapter 3.

This concludes our very first example -- you just saw how we could build and a train a neural network to classify handwritten digits. In the next chapter, we will go in detail over every moving piece we just previewed, and clarify what is really going on behind the scenes. You will learn about "tensors", the data-storing objects going into the network, about tensor operations, which layers are made of, and about gradient descent, which allows our network to learn from its training examples.

Note: you should call `stop_orca_context()` when the program finishes.

In [16]:
stop_orca_context()

Stopping orca context
