In [1]:
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

from scipy.special import logsumexp
from numpy import ndarray

from typing import Callable, List

# 导入smartflow
from smartflow.operation import *
from smartflow.operation import *
from smartflow.trainer import *
from smartflow.layer import *
from smartflow.loss import *
from smartflow.util import *
from smartflow.eval import *
from smartflow.nn import *

In [2]:
# 设置logging输出到jupyter notebook

import importlib as implib
import logging as log
import sys

implib.reload(log)
log.basicConfig(format="[smartflow] %(message)s", stream=sys.stdout, level=log.INFO)

# 05. CNN
In this chapter, we’ll cover `CNN`. `CNN` are the standard neural network architecture used for prediction when the input observations are images. 

In the book, we’ve focused exclusively on fully connected neural networks, which we implemented as a series of `Dense` layers. Thus, we’ll start this chapter by reviewing some key elements of these networks and use this to motivate why we might want to use a different architecture for images. We’ll then cover `CNN` in a manner similar to that in which we introduced other concepts in this book: we’ll first discuss how they work at a high level, then move to discussing them at a lower level, and finally show in detail how they work by coding up the convolution operation from scratch. By the end of this chapter, you’ll have a thorough enough understanding of how `CNN` work to be able to use them both to solve problems and to learn about advanced `CNN` variants, such as `ResNet`, `DenseNet`, and `Octave Convolution` on your own.

## 5.1 Neural Networks and Representation Learning
Neural networks initially receive data on observations, with each observation represented by some number $n$ features. So far we’ve seen two examples of this in two very different domains: the first was the house prices dataset, where each observation was made up of $13$ features, each of which represented a numeric characteristic about that house. The second was the MNIST dataset of handwritten digits; since the images were represented with $784$ pixels, each observation was represented by $784$ values indicating the lightness or darkness of each pixel.

In each case, after appropriately scaling the data, we were able to build a model that predicted the appropriate outcome for that dataset with high accuracy. Also in each case, a simple neural network model with one hidden layer performed better than a model without that hidden layer. Why is that? One reason, as I showed in the case of the house prices data, is that the neural network could learn nonlinear relationships between input and output. However, a more general reason is that in machine learning, we often need linear combinations of our original features in order to effectively predict our target. 

Let’s say that the pixel values for an MNIST digit are $x_1$ through $x_{784}$. It could be the case, for example, that a combination of $x_1$ being higher than average, $x_{139}$ being lower than average, and $x_{237}$ also being lower than average strongly predicts that an image will be of digit $9$. There may be many other such combinations, all of which contribute positively or negatively to the probability that an image is of a particular digit. Neural networks can automatically discover combinations of the original features that are important through their training process. That process starts by creating initially random combinations of the original features via multiplication by a random weight matrix; through training, the neural network learns to refine combinations that are helpful and discard those that aren’t. This process of learning which combinations of features are important is known as `representation learning`, and it’s the main reason why neural networks are successful across different domains. This is summarized in `Figure 5-1`.

<img src="images/05_01.png" style="width:600px;"/>

Is there any reason to modify this process for image data? The fundamental insight that suggests the answer is “yes” is that in images, the interesting "combinations of features" (pixels) tend to come from pixels that are close together in the image. In an image, it is simply much less likely that an interesting feature will result from a combination of $9$ randomly selected pixels throughout the image than from a $3\times 3$ patch of adjacent pixels. We want to exploit this fundamental fact about image data: that the order of the features matters since it tells us which pixels are near each other spatially, whereas in the house prices data the order of the features doesn’t matter. But how do we do it?

### A Different Architecture for Image Data
The solution, at a high level, will be to create combinations of features, as before, but an order of magnitude more of them, and have each one be only a combination of the pixels from a small rectangular patch in the input image. `Figure 5-2` describes this.

<img src="images/05_02.png" style="width:600px;"/>

Having our neural network learn combinations of all of the input features, that is, combinations of all of the pixels in the input image, turns out to be very inefficient, since it ignores the insight described in the prior section: that most of the interesting combinations of features in images occur in these small patches. Nevertheless, previously it was at least extremely easy to compute new features that were combinations of all the input features: if we had $f$ input features and wanted to compute $n$ new features, we could simply multiply the `ndarray` containing our input features by an $f\times n$ matrix. What operation can we use to compute many combinations of the pixels from local patches of the input image? The answer is the *convolution operation*.

### The Convolution Operation
Before we describe the convolution operation, let’s make clear what we mean by "a feature that is a combination of pixels from a local patch of an image". Let’s say we have a $5 \times 5$ input image `I`:

$$I = \begin{bmatrix} i_{11} & i_{12} & i_{13} & i_{14} & i_{15} \\ i_{21} & i_{22} & i_{23} & i_{24} & i_{25} \\ i_{31} & i_{32} & i_{33} & i_{34} & i_{35} \\ i_{41} & i_{42} & i_{43} & i_{44} & i_{45} \\ i_{51} & i_{52} & i_{53} & i_{54} & i_{55} \end{bmatrix}$$

And let’s say we want to calculate a new feature that is a function of the $3 \times 3$ patch of pixels in the middle. Well, just as we’ve defined new features as linear combinations of old features in the neural networks we’ve seen so far, we’ll define a new feature that is a function of this $3 \times 3$ patch, which we’ll do by defining a $3 \times 3$ set of weights, $W$:

$$W = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33}  \end{bmatrix}$$

Then we’ll simply take the dot product of $W$ with the relevant patch from $I$ to get the value of the feature in the output, which, since the section of the input image involved was centered at $(3,3)$, we’ll denote as $o_{33}$ (the $o$ stands for "output"):

$$o_{33} = w_{11}i_{22} + w_{12}i_{23} + w_{13}i_{24} + w_{21}i_{32} + w_{22}i_{33} + w_{23}i_{34} + w_{31}i_{42} + w_{32}i_{43} + w_{33}i_{44}$$

This value will then be treated like the other computed features we’ve seen in neural networks: it may have a bias added to it and then will probably be fed through an activation function, and then it will represent a "neuron" or "learned feature" that will get passed along to subsequent layers of the network. Thus we can define features that are functions of small patches of an input image.

How should we interpret such features? It turns out that features computed in this way have a special interpretation: they represent whether a visual pattern defined by the weights is present at that location of the image. The fact that $3 \times 3$ or $5 \times 5$ arrays of numbers can represent "pattern detectors" when their dot product is taken with the pixel values at each location of an image has been well known in the field of computer vision for a long time. For example, taking the dot product of the following $3 \times 3$ array of numbers:

$$\begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{bmatrix}$$

with a given section of an input image detects whether there is an edge at that location of the image. There are similar matrices known to be able to detect whether corners exist, whether vertical or horizontal lines exist, and so on.

Now suppose that we used the same set of weights $W$ to detect whether the visual pattern defined by $W$ existed at each location in the input image. We could imagine "sliding $W$ over the input image", taking the dot product of $W$ with the pixels at each location of the image, and ending up with a new image $O$ of almost identical size to the original image (it may be slightly different, depending on how we handle the edges). This image $O$ would be a kind of "feature map" showing the locations in the input image where the pattern defined by $W$ was present. This operation is in fact what happens in convolutional neural networks; it is called a *convolution*, and its output is indeed called a *feature map*.

This operation is at the core of how `CNN` work. Before we can incorporate it into a full-fledged `Operation`, of the kind we’ve seen in the prior chapters, we have to add another dimension to it—literally.

### The Multichannel Convolution Operation
To review: `CNN` differ from regular neural networks in that they create an order of magnitude more features, and in that each feature is a function of just a small patch from the input image. Now we can get more specific: starting with $n$ input pixels, the convolution operation just described will create $n$ output features, one for each location in the input image. What actually happens in a convolutional `Layer` in a neural network goes one step further: there, we’ll create $f$ sets of $n$ features, each with a corresponding (initially random) set of weights defining a visual pattern whose detection at each location in the input image will be captured in the feature map. These $f$ feature maps will be created via $f$ convolution operations. This is captured in `Figure 5-3`.

<img src="images/05_03.png" style="width:600px;"/>

Now that we’ve introduced a bunch of concepts, let’s define them for clarity. While each "set of features" detected by a particular set of weights is called a **feature map**, in the context of a **convolutional Layer**, the number of feature maps is referred to as the number of **channels** of the Layer, this is why the operation involved with the `Layer` is called the **multichannel convolution**. In addition, the $f$ sets of weights $W_i$ are called the **convolutional filters**.

## 5.2 Convolutional Layers
Now that we understand the multichannel convolution operation, we can think about how to incorporate this operation into a neural network layer. Previously, our neural network layers were relatively straightforward: they received two-dimensional `ndarrays` as input and produced two-dimensional `ndarrays` as output. Based on the description in the prior section, however, convolutional layers will have a 3D `ndarray` as output for a single image, with dimensions $\textit{number of channels}\times\textit{image height}\times\textit{image width}$.

This raises a question: how can we feed this `ndarray` forward into another convolutional layer to create a "deep convolutional" neural network? We’ve seen how to perform the convolution operation on an image with a single channel and our filters; how can we perform the multichannel convolution on an input with multiple channels, as we’ll have to do when two convolutional layers are strung together? Understanding this is the key to understanding deep convolutional neural networks.

Consider what happens in a neural network with fully connected layers: in the first hidden layer, we have, let’s say, $h_1$ features that are combinations of all of the original features from the input layer. In the layer that follows, the features are combinations of all of the features from the prior layer, so that we might have $h_2$ "features of features" of the original features. To create this next layer of $h_2$ features, we use $h_1 \times h_2$  weights to represent that each of the $h_2$ features is a function of each of the $h_1$ features in the prior layer.

As described in the prior section, an analogous process happens in the first layer of a convolutional neural network: we first transform the input image into $m_1$ feature maps, using $m_1$ convolutional filters. We should think of the output of this layer as representing whether each of the $m_1$ different visual patterns represented by the weights of the $m_1$ filters is present at each location in the input image. Just as different layers of a fully connected neural network can contain different numbers of neurons, the next layer of the convolutional neural network could contain $m_2$ filters. In order for the network to learn complex patterns, the interpretation of each of these should be whether each of the "patterns of patterns" or higher-order visual features represented by combinations of the $m_1$ visual patterns from the prior layer was present at that location of the image. This implies that if the output of the convolutional layer is a 3D `ndarray` of shape $m_2 \ \textit{channels}\times\textit{image height}\times\textit{image width}$, then a given location in the image on one of the $m_2$ feature maps is a linear combination of convolving $m_1$ different filters over that same location in each of the corresponding $m_1$ feature maps from the prior layer. This will allow each location in each of the $m_2$ filter maps to represent a combination of the $m_1$ visual features already learned in the prior convolutional layer.

### Implementation Implications
This understanding of how two multichannel convolutional layers are connected tells us how to implement the operation: just as we need $h_1\times h_2$ weights to connect a fully connected layer with $h_1$ neurons to one with $h_2$, we need $m_1\times m_2$ convolutional filters to connect a convolutional layer with $m_1$ channels to one with $m_2$. With this last detail in place, we can now specify the dimensions of the `ndarrays` that will make up the input, output, and parameters of the full, multichannel convolution operation:

1. The input will have shape:
    + Batch size
    + Input channels
    + Image height
    + Image width

2. The output will have shape:
    + Batch size
    + Output channels
    + Image height
    + Image width

3. The convolutional filters themselves will have shape:
    + Input channels
    + Output channels
    + Filter height
    + Filter width

> The order of the dimensions may vary from library to library, but these four dimensions will always be present.

We’ll keep all of this in mind when we implement this convolution operation later in the chapter.

### The Differences Between Convolutional and Fully Connected Layers
At the beginning of the chapter, we discussed the differences between convolutional and fully connected layers at a high level; `Figure 5-4` revisits that comparison, now that we’ve described convolutional layers in more detail.

<img src="images/05_04.png" style="width:600px;"/>

In addition, one last difference between the two kinds of layers is the way in which the individual neurons themselves are interpreted:
+ The interpretation of each neuron of a fully connected layer is that it detects whether or not a particular combination of the features learned by the prior layer is present in the current observation.

+ The interpretation of a neuron of a convolutional layer is that it detects whether or not a particular combination of visual patterns learned by the prior layer is present at the given location of the input image.

There’s one more problem we need to solve before we can incorporate such a layer into a neural network: how to use the dimensional `ndarrays` we obtain as output to make predictions.

### Making Predictions with Convolutional Layers: The Flatten Layer
We’ve covered how convolutional layers learn features that represent whether visual patterns exist in images and store those features in layers of feature maps; how do we use these layers of feature maps to make predictions? When using fully connected neural networks to predict which of $10$ classes an image belonged to in the prior chapter, we just had to ensure that the last layer had dimension $10$; we could then feed these $10$ numbers into the softmax cross entropy loss function to ensure they were interpreted as probabilities. Now we need to figure out what we can do in the case of our convolutional layer, where we have a three-dimensional `ndarray` per observation of shape $m-\textit{channels}\times\textit{image height}\times\textit{image width}$.

To see the answer, recall that each neuron simply represents whether a particular combination of visual features (which, if this is a deep convolutional neural network, could be a feature of features or a feature of features of features) is present at a given location in the image. This is no different from the features that would be learned if we applied a fully connected neural network to this image: the first fully connected layer would represent features of the individual pixels, the second would represent features of these features, and so on. And in a fully connected architecture, we would simply treat each "feature of features" that the network had learned as a single neuron that would be used as input to a prediction of which class the image belonged to.

It turns out that we can do the same thing with `CNN`. We treat the $m$ feature maps as $m\times\textit{image}_{height}\times\textit{image}_{width}$ neurons and use a `Flatten` operation to squash these three dimensions (the number of channels, the image height, and the image width) down into a one-dimensional vector, after which we can use a simple matrix multiplication to make our final predictions. The intuition for why this works is that each individual neuron fundamentally represents the same "kind of thing" as the neurons in a fully connected layer - specifically, whether a given visual feature (or combination of features) is present at a given location in an image - and thus we can treat them the same way in the final layer of the neural network.

We’ll see how to implement the `Flatten` layer later in the chapter. But before we dive into the implementation, let’s discuss another kind of layer that is important in many `CNN` architectures, though we won’t cover it in great detail in this book.

### Pooling Layers
Pooling layers are another kind of layer commonly used in `CNN`. They simply downsample each of the feature maps created by a convolution operation; for the most typically used pooling size of $2$, this involves mapping each $2 \times 2$ section of each feature map either to the maximum value of that section, in the case of `max-pooling`, or to the average value of that section, in the case of `averagen-pooling`. For an $n\times n$ image, then, this would map the entire image to one of size $\frac{n}{2}\times\frac{n}{2}$. `Figure 5-5` illustrates this.

<img src="images/05_05.png" style="width:600px;"/>

The main advantage of pooling is computational: by downsampling the image to contain one-fourth as many pixels as the prior layer, pooling decreases both the number of weights and the number of computations needed to train the network by a factor of 4; this can be further compounded if multiple pooling layers are used in the network, as they were in many architectures in the early days of `CNN`. The downside of pooling, of course, is that only one-fourth as much information can be extracted from the downsampled image. However, the fact that architectures showed very strong performance on benchmarks in image recognition despite the use of pooling suggested that, even though pooling was causing the networks to "lose information" about the images by decreasing the images’ resolution, the trade-offs in terms of increased computational speed were worth it. Nevertheless, pooling was considered by many to be a trick that just happened to work but should probably be done away with; as `Geoffrey Hinton` wrote on a Reddit AMA in 2014, "The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster". And indeed, most recent `CNN` architectures (such as `ResNet`) use pooling minimally or not at all. Thus, in this book, we’re not going to implement pooling layers, but given their importance for "putting CNNs on the map" via their use in famous architectures such as `AlexNet`, we mention them here for completeness.

##### Applying CNN beyond images
Everything we have described so far is extremely standard for dealing with images using neural networks: the images are typically represented as a set of $m_1$ channels of pixels, where $m_1 = 1$ for black-and-white images, and $m_1 = 3$ for color images—and then some number $m_2$ of convolution operations are applied to each channel (using the $m_1 \times m_2$ filter maps as explained previously), with this pattern continuing on for several layers. This has all been covered in other treatments of convolutional neural networks; what is less commonly covered is that the idea of organizing data into "channels" and then processing that data using a CNN goes beyond just images. For example, this data representation was a key to `DeepMind`’s series of AlphaGo programs showing that neural networks could learn to play `Go`(`DeepMind (David Silver et al.), Mastering the Game of Go Without Human Knowledge, 2017`).

## 5.3 Implementing the Multichannel Convolution Operation
It turns out that implementing this daunting operation - which involves a four-dimensional input `ndarray` and a four-dimensional parameter `ndarray` - is much clearer if we first examine the one-dimensional case. Building up to the full operation from that starting point will turn out mostly to be a matter of adding a bunch of for loops. 

### The Forward Pass

The convolution in one dimension is conceptually identical to the convolution in two dimensions: we take in a one-dimensional input and a one-dimensional convolutional filter as inputs and then create the output by sliding the filter along the input.

Let’s suppose our input is of length 5:

$$\textit{Input} = [t_1, t_2, t_3, t_4, t_5]$$

And let’s say the size of the "patterns" we want to detect is length 3:

$$\textit{Filter} = [w_1, w_2, w_3]$$

The first element of the output would be created by convolving the first element of the input with the filter:

$$\textit{Output Feature 1}: O_1 = t_1w_1+t_2w_2+t_3w_3$$

The second element of the output would be created by sliding the filter one unit to the right and convolving it with the next set values of the series:

$$\textit{Output Feature 2}: O_2 = t_2w_1+t_3w_2+t_4w_3$$

Fair enough. However, when we compute the next output value, we realize that we have run out of room:

$$\textit{Output Feature 3}: O_3 = t_3w_1+t_4w_2+t_5w_3$$

We have hit the end of our input, and the resulting output has just three elements, when we started with five! How can we address this?

##### Padding
To avoid the output shrinking as a result of the convolution operation, we’ll introduce a trick used throughout `CNN`: we "pad" the input with zeros around the edges, enough so that the output remains the same size as the input. Otherwise, every time we convolve a filter over the input, we’ll end up with an output that is slightly smaller than the input, as seen previously.

As you can reason from the preceding convolution example: for a filter of size 3, there should be one unit of padding around the edges to keep the output the same size as the input. More generally, since we almost always use odd-numbered filter sizes, we add padding equal to the filter size divided by 2 and rounded down to the nearest integer.

Coding up this part turns out to be pretty straightforward. Before we do, let’s summarize the steps we just discussed:
1. We ultimately want to produce an output that is the same size as the input.
2. To do this without “shrinking” the output, we’ll first need to pad the input.
3. Then we’ll have to write some sort of loop that goes through the input and convolves each position of it with the filter.

We’ll start with our input and our filter:

In [3]:
input_1d = np.array([1,2,3,4,5])
param_1d = np.array([1,1,1])

Here’s a helper function that can pad our one-dimensional input on each end:

In [4]:
def _pad_1d(inp: np.ndarray, num: int) -> np.ndarray:
    z = np.array([0])
    z = np.repeat(z, num) 
    return np.concatenate([z, inp, z])

_pad_1d(input_1d, 1)

array([0, 1, 2, 3, 4, 5, 0])

What about the convolution itself? Observe that for each element in the output that we want to produce, we have a corresponding element in the padded input where we "start" the convolution operation; once we figure out where to start, we simply loop through all the elements in the filter, doing a multiplication at each element and adding the result to the total.

How do we find this "corresponding element"? Note that, simply, the value at the first element in the output gets its value starting at the first element of the padded input! This makes the for loop quite easy to write:

In [5]:
def conv_1d(inp: np.ndarray, param: np.ndarray) -> np.ndarray:
    # assert correct dimensions assert_dim(inp, 1) assert_dim(param, 1)
    # pad the input
    param_len = param.shape[0]
    param_mid = param_len // 2 
    input_pad = _pad_1d(inp, param_mid)
    # initialize the output 
    out = np.zeros(inp.shape)
    # perform the 1d convolution 
    for o in range(out.shape[0]):
        for p in range(param_len):
            out[o] += param[p] * input_pad[o+p]
    # ensure shapes didn't change 
    assert_same_shape(inp, out)
    return out

conv_1d(input_1d, param_1d)

array([ 3.,  6.,  9., 12.,  9.])

That’s simple enough. Before we move on to the backward pass of this operation, let’s briefly discuss a hyperparameter of convolutions that we’re glossing over: `stride`.

##### A note on stride
We noted earlier that pooling operations were one way to downsample images from feature maps. In many early convolutional architectures, these did indeed significantly reduce the amount of computation needed without any significant hit to accuracy; nevertheless, they’ve fallen out of favor because of their downside: they effectively downsample the image so that an image with just half the resolution is passed forward into the next layer.

A much more widely accepted way to do this is to modify the stride of the convolution operation. The stride is the amount that the filter is incrementally slid over the image. In the previous case, we are using a stride of $1$, and as a result each filter is convolved with every element of the input, which is why the output ends up being the same size as the input. With a stride of $2$, the filter would be convolved with every other element of the input image, so that the output would be half the size of the input; with a stride of $3$, the filter would be convolved with every third element of the input image, and so on. This means that, for example, using a stride of $2$ would result in the same output size and thus much the same reduction in computation we would get from pooling with size $2$, but without as much loss of information: with pooling of size 2, only one-fourth of the elements in the input have any effect on the output, whereas with a stride of $2$, every element of the input has some effect on the output. The use of a stride of greater than $1$ is thus significantly more prevalent than pooling for downsampling even in the most advanced `CNN` architectures of today.

Nevertheless, in this book I’ll just show examples with a stride of $1$, and modifying these operations to allow a stride of greater than $1$ is left as an exercise for the reader. Using a stride equal to $1$ also makes writing the backward pass easier.

### Convolutions: The Backward Pass
The backward pass is where convolutions get a bit trickier. Let’s recall what we’re trying to do: before, we produced the output of a convolution operation using the input and the parameters. We now want to compute:
+ The partial derivative of the loss with respect to each element of the input to the convolution operation - `inp` previously
+ The partial derivative of the loss with respect to each element of the *filter* - `param_1d` previously

Think of how the `ParamOperations` we saw in `Chapter 4` work: in the backward method, they receive an output gradient representing how much each element of the output ultimately affects the loss and then use this output gradient to compute the gradients for the input and the parameters. So we need to write a function that takes in an `output_grad` with the same shape as the input and produces an `input_grad` and a `param_grad`.

How can we test whether the computed gradients are correct? We’ll bring back an idea from the first chapter: we know that the partial derivative of a sum with respect to any one of its inputs is $1$ (if the sum $s = a+b+c$, then $\frac{\partial S}{\partial a} = \frac{\partial S}{\partial b} = \frac{\partial S}{\partial c} = 1$). So we can compute the `input_grad` and `param_grad` quantities using our `_input_grad` and `_param_grad` functions (which we’ll reason through and write shortly) and an `output_grad` equal to all $1$. Then we’ll check whether these gradients are correct by changing elements of the input by some quantity $\alpha$ and seeing whether the resulting sum changes by the gradient times $\alpha$.

##### What "should" the gradient be?
Using the logic just described, let’s calculate what an element of the gradient vector for the input should be:

In [6]:
def conv_1d_sum(inp: np.ndarray, param: np.ndarray) -> np.ndarray: 
    out = conv_1d(inp, param) 
    return np.sum(out)

# randomly choose to increase 5th element by 1
input_1d = np.array([1,2,3,4,5])
input_1d_2 = np.array([1,2,3,4,6])
param_1d = np.array([1,1,1])

In [7]:
conv_1d_sum(input_1d, param_1d), conv_1d_sum(input_1d_2, param_1d)

(39.0, 41.0)

So, the gradient of the fifth element of the input should be $41 – 39 = 2$.

Now let’s try to reason through how we should compute such a gradient without simply computing the difference between these two sums. Here is where things get interesting.

##### Computing the gradient of a 1D convolution
We see that increasing this element of the input increased the output by $2$. Taking a close look at the output shows exactly how it does this:

<img src="images/a_05_01.png" style="width:200px;"/>

This particular element of the input is denoted $t_5$. It appears in the output in two places:
+ As part of $o_4$, it is multiplied by $w_3$.

+ As part of $o_5$, it is multiplied by $w_2$.

To help see the general pattern of how inputs map to the sum of outputs, note that if there was an $o_6$ present, $t_5$ would also contribute to the output through being multiplied by $w_1$.

Therefore, the amount that $t_5$ ultimately affects the loss, which we can denote as $\displaystyle\frac{\partial L}{\partial t_5}$ will be:

$$\frac{\partial L}{\partial t_5} = \frac{\partial L}{\partial o_4}\times w_3 + \frac{\partial L}{\partial o_5}\times w_2 + \frac{\partial L}{\partial o_6}\times w_1$$

Of course, in this simple example, when the loss is just the sum, $\displaystyle\frac{\partial L}{\partial o_i}=1$ for all elements, in the output (except for the "padding" elements for which this quantity is 0). This sum is very easy to compute: it is simply $w_2 + w_3$, which is indeed $2$ since $w_2 = w_3 = 1$.

##### What’s the general pattern?
Now let’s look for the general pattern for a generic input element. This turns out to be an exercise in keeping track of indices. Since we’re translating math into code here, let’s use $o_{i}^{grad}$ to denote the $i$-th element of the output gradient (since we’ll ultimately be accessing it via `output_grad[i]`). Then:

$$\frac{\partial L}{\partial t_5} = o_{4}^{grad}\times w_3 + o_{5}^{grad}\times w_2 + o_{6}^{grad}\times w_1$$

Looking closely at this output, we can reason similarly that:

$$\frac{\partial L}{\partial t_3} = o_{2}^{grad}\times w_3 + o_{3}^{grad}\times w_2 + o_{4}^{grad}\times w_1$$

and

$$\frac{\partial L}{\partial t_4} = o_{3}^{grad}\times w_3 + o_{4}^{grad}\times w_2 + o_{5}^{grad}\times w_1$$

There’s clearly a pattern here, and translating it into code is a bit tricky, especially since the indices on the output increase at the same time the indices on the weights decrease. Nevertheless, the way to express this turns out to be via the following double for loop:

```python
# param: in our case an ndarray of shape (1,3) 
# param_len: the integer 3 
# inp: in our case an ndarray of shape (1,5) 
# input_grad: always an ndarray the same shape as "inp" 
# output_pad: in our case an ndarray of shape (1,7) 
for o in range(inp.shape[0]):
    for p in range(param.shape[0]):
        input_grad[o] += output_pad[o+param_len-p-1] * param[p]
```

This does the appropriate incrementing of the indices of the weights, while decreasing the weights on the output at the same time.

Though it may not be obvious now, reasoning through this and getting it is out to be the trickiest part of calculating the gradients for convolution operations. Adding more complexity to this, such as batch sizes, convolutions with two-dimensional inputs, or inputs with multiple channels, is simply a matter of adding more for loops to the preceding lines, as we’ll see in the next few sections.

##### Computing the parameter gradient
We can reason similarly about how increasing an element of the filter should increase the output. First, let’s increase (arbitrarily) the first element of the filter by one unit and observe the resulting impact on the sum:

In [8]:
input_1d = np.array([1,2,3,4,5])
# randomly choose to increase first element by 1
# param_1d = np.array([1,1,1])
param_1d_2 = np.array([2,1,1])

conv_1d_sum(input_1d, param_1d), conv_1d_sum(input_1d, param_1d_2)

(39.0, 49.0)

So we should find that $\displaystyle\frac{\partial L}{\partial w_1}=10$.

Just as we did for the input, by closely examining the output and seeing which elements of the filter affect it, as well as padding the input to more clearly see the pattern, we see that:

$$w_{1}^{grad} = t_0\times o_{1}^{grad} + t_1\times o_{2}^{grad} + t_2\times o_{3}^{grad} + t_3\times o_{4}^{grad} + t_4\times o_{5}^{grad}$$

And since, for the sum, all of the $o_{i}^{grad}$ elements are just $1$, and $t_0$ is $0$, we have:

$$w_{1}^{grad} = t_1 + t_2 + t_3 + t_4 = 1 + 2 + 3 + 4 = 10$$

This confirms the calculation from earlier.

Coding this up turns out to be easier than writing the code for the input gradient, since this time "the indices are moving in the same direction". Within the same nested for loop, the code is:

```python
# param: in our case an ndarray of shape (1,3) 
# param_grad: an ndarray the same shape as param 
# inp: in our case an ndarray of shape (1,5) 
# input_pad: an ndarray the same shape as (1,7) 
# output_grad: in our case an ndarray of shape (1,5) 
for o in range(inp.shape[0]):
    for p in range(param.shape[0]):
        param_grad[p] += input_pad[o+p] * output_grad[o]
```

Finally, we can combine these two computations and write a function to compute both the input gradient and the filter gradient with the following steps:
1. Take the input and filter as arguments.

2. Compute the output.

3. Pad the input and the output gradient (to get, say, `input_pad` and `output_pad`).

4. As shown earlier, use the padded output gradient and the filter to compute the gradient.

5. Similarly, use the output gradient (not padded) and the padded input to compute the filter gradient.

That concludes our explanation of how to implement convolutions in 1D! As we’ll see in the next several sections, extending this reasoning to work on two-dimensional inputs, batches of two-dimensional inputs, or even multichannel batches of two-dimensional inputs is (perhaps surprisingly) straightforward.

In [9]:
def _param_grad_1d(inp:np.ndarray, param:np.ndarray, output_grad:np.ndarray = None) -> np.ndarray:
    param_len = param.shape[0]
    param_mid = param_len // 2
    input_pad = _pad_1d(inp, param_mid)

    if output_grad is None:
        output_grad = np.ones_like(inp)
    else:
        assert_same_shape(inp, output_grad)
    # Zero padded 1 dimensional convolution
    param_grad = np.zeros_like(param)
    input_grad = np.zeros_like(inp)
    
    for o in range(inp.shape[0]):
        for p in range(param.shape[0]):
            param_grad[p] += input_pad[o+p] * output_grad[o]
    assert_same_shape(param_grad, param)
    return param_grad

def _input_grad_1d(inp:np.ndarray, param:np.ndarray, output_grad:np.ndarray = None) -> np.ndarray:
    param_len = param.shape[0]
    param_mid = param_len // 2
    inp_pad = _pad_1d(inp, param_mid)
    
    if output_grad is None:
        output_grad = np.ones_like(inp)
    else:
        assert_same_shape(inp, output_grad)
    
    output_pad = _pad_1d(output_grad, param_mid)
    # Zero padded 1 dimensional convolution
    param_grad = np.zeros_like(param)
    input_grad = np.zeros_like(inp)

    for o in range(inp.shape[0]):
        for f in range(param.shape[0]):
            input_grad[o] += output_pad[o+param_len-f-1] * param[f]
    assert_same_shape(param_grad, param)
    return input_grad

In [10]:
input_1d = np.array([1,2,3,4,5])
param_1d = np.array([1,1,1])
_param_grad_1d(input_1d, param_1d), _input_grad_1d(input_1d, param_1d)

(array([10, 15, 14]), array([2, 3, 3, 3, 2]))

### Batches, 2D Convolutions, and Multiple Channels
First, let’s add the capability for these convolution functions to work with batches of inputs - 2D inputs whose first dimension represents the batch size of the input and whose second dimension represents the length of the 1D sequence:

```python
input_1d_batch = np.array([[0,1,2,3,4,5,6], [1,2,3,4,5,6,7]])
```

We can follow the same general steps defined before: we’ll first pad the input, use this to compute the output, and then pad the output gradient to compute both the input and filter gradients.

##### 1D convolutions with batches: forward pass
The only difference in implementing the forward pass when the input has a second dimension representing the batch size is that we have to pad and compute the output for each observation individually (as we did previously) and then stack the results to get a batch of outputs. For example, `conv_1d` becomes:

In [11]:
def conv_1d_batch(inp: ndarray, param: ndarray) -> ndarray:
    outs = [conv_1d(obs, param) for obs in inp] 
    return np.stack(outs)

input_1d_batch = np.array([[0,1,2,3,4,5,6], [1,2,3,4,5,6,7]])
param_1d = np.array([1,1,1])
conv_1d_batch(input_1d_batch, param_1d)

array([[ 1.,  3.,  6.,  9., 12., 15., 11.],
       [ 3.,  6.,  9., 12., 15., 18., 13.]])

##### 1D convolutions with batches: backward pass
The backward pass is similar: computing the input gradient now simply takes the for loop for computing the input gradient from the prior section, computes it for each observation, and stacks the results:

```python
# "_input_grad" is the function containing the for loop from earlier:
# it takes in a 1d input, a 1d filter, and a 1d output_gradient and computes the input grad
grads = [_input_grad(inp[i], param, out_grad[i])[1] for i in range(batch_size)] 
np.stack(grads)
```

The gradient for the filter when dealing with a batch of observations is a bit different. This is because the filter is convolved with every observation in the input and is thus connected to every observation in the output. So, to compute the parameter gradient, we have to loop through all of the observations and increment the appropriate values of the parameter gradient as we do so. Still, this just involves adding an outer for loop to the code to compute the parameter gradient that we saw earlier:

```python
# param: in our case an ndarray of shape (1,3) 
# param_grad: an ndarray the same shape as param 
# inp: in our case an ndarray of shape (1,5) 
# input_pad: an ndarray the same shape as (1,7) 
# output_grad: in our case an ndarray of shape (1,5)
for i in range(inp.shape[0]): # inp.shape[0] = 2
    for o in range(inp.shape[1]): # inp.shape[0] = 5
        for p in range(param.shape[0]): # param.shape[0] = 3
            param_grad[p] += input_pad[i][o+p] * output_grad[i][o]
```

Adding this dimension on top of the original 1D convolution was indeed simple; extending this from one - to two - dimensional inputs is similarly straightforward.

### 2D Convolutions
The 2D convolution is a straightforward extension of the 1D case because, fundamentally, the way the input is connected to the output via the filters in each dimension of the 2D case is identical to the 1D case. As a result, the high-level steps on both the forward and backward passes remain the same:
1. On the forward pass, we:
    + Appropriately pad the input.
    + Use the padded input and the parameters to compute the output.

2. On the backward pass, to compute the input gradient we:
    + Appropriately pad the output gradient.
    + Use this padded output gradient, along with the input and the parameters, to compute both the input gradient and the parameter gradient.

3. Also on the backward pass, to compute the parameter gradient we:
    + Appropriately pad the input.
    + Loop through the elements of the padded input and increment the parameter gradient appropriately as we go along.

##### 2D convolutions: coding the forward pass
To make this concrete, recall that for 1D convolutions the code for computing the output given the input and the parameters on the forward pass looked as follows:

```python
# input_pad: a version of the input that has been padded appropriately based on 
# the size of param
out = np.zeros_like(inp) 

for o in range(out.shape[0]):
    for p in range(param_len):
        out[o] += param[p] * input_pad[o+p]
```

For 2D convolutions, we simply modify this to be:

```python
# input_pad: a version of the input that has been padded appropriately based on 
# the size of param
out = np.zeros_like(inp) 

for o_w in range(img_size): # loop through the image height
    for o_h in range(img_size): # loop through the image width 
        for p_w in range(param_size): # loop through the parameter width 
            for p_h in range(param_size): # loop through the parameter height
                out[o_w][o_h] += param[p_w][p_h] * input_pad[o_w+p_w][o_h+p_h]
```

You can see that we’ve simply "blown each for loop out" into two for loops.

The extension to two dimensions when we have a batch of images is also similar to the 1D case: just as we did there, we simply add a for loop to the outside of the loops shown here.

##### 2D convolutions: coding the backward pass
Sure enough, just as in the forward pass, we can use the same indexing for the backward pass as in the 1D case. Recall that in the 1D case, the code was:

```python
input_grad = np.zeros_like(inp) 

for o in range(inp.shape[0]):
    for p in range(param_len):
        input_grad[o] += output_pad[o+param_len-p-1] * param[p]
```

In the 2D case, the code is simply:

```python
# output_pad: a version of the output that has been padded appropriately based 
# on the size of param 
input_grad = np.zeros_like(inp)

for i_w in range(img_width):
    for i_h in range(img_height):
        for p_w in range(param_size):
            for p_h in range(param_size):
                input_grad[i_w][i_h]+=output_pad[i_w+param_size-p_w-1][i_h+param_size-p_h-1]*param[p_w][p_h]
```

Note that the indexing on the output is the same as in the 1D case but is simply taking place in two dimensions; in the 1D case, we had:

```python
output_pad[i+param_size-p-1] * param[p]
```

and in the 2D case, we have:

```python
output_pad[i_w+param_size-p_w-1][i_h+param_size-p_h-1] * param[p_w][p_h]
```

The other facts from the 1D case also apply:
+ For a batch of input images, we simply perform the preceding operation for each observation and then stack the results.

+ For the parameter gradient, we have to loop through all the images in the batch and add components from each one to the appropriate places in the parameter gradient:

```python
# input_pad: a version of the input that has been padded appropriately based on # the size of param

param_grad = np.zeros_like(param)

for i in range(batch_size): # equal to inp.shape[0] 
    for o_w in range(img_size):
        for o_h in range(img_size):
            for p_w in range(param_size):
                for p_h in range(param_size):
                    param_grad[p_w][p_h] += input_pad[i][o_w+p_w][o_h+p_h]*output_grad[i][o_w][o_h]
```

At this point, we’ve almost written the code for the complete multichannel convolution operation; currently, our code convolves filters over a two-dimensional input and produces a two-dimensional output. Of course, as we described earlier, each convolutional layer not only has neurons arranged along these two dimensions but also has some number of "channels" equal to the number of feature maps that the layer creates. 

### The Last Element: Adding “Channels”
How can we modify what we’ve written thus far to account for cases where both the input and the output are multichannel? The answer, as it was when we added batches earlier, is simple: we add two outer for loops to the code we’ve already seen - one loop for the input channels and another for the output channels. By looping through all combinations of the input channel and the output channel, we make each output feature map a combination of all of the input feature maps, as desired.

For this to work, we will have to always represent our images as three-dimensional `ndarrays`, as opposed to the two-dimensional arrays we’ve been using; we’ll represent black-and-white images with one channel and color images with three channels (one for the red values at each location in the image, one for the blue values, and one for the green values). Then, regardless of the number of channels, the operation proceeds as described earlier, with a number of feature maps being created from the image, each of which is a combination of the convolutions resulting from all of the channels in the image (or from the channels in the prior layer, if dealing with layers further on in the network).

##### Forward pass
Given all this, the full code to compute the output for a convolutional layer, given four-dimensional `ndarrays` for the input and the parameters, is:

In [12]:
def _compute_output_obs(obs: np.ndarray, param: np.ndarray) -> np.ndarray:
    '''
    obs: [channels, img_width, img_height]
    param: [in_channels, out_channels, fil_width, fil_height]    
    '''
    assert_dim(obs, 3)
    assert_dim(param, 4)
    
    param_size = param.shape[2]
    param_mid = param_size // 2
    obs_pad = _pad_2d_channel(obs, param_mid)
    
    in_channels = param.shape[0]
    out_channels = param.shape[1]
    img_size = obs.shape[1]
    
    out = np.zeros((out_channels,) + obs.shape[1:])
    for c_in in range(in_channels):
        for c_out in range(out_channels):
            for o_w in range(img_size):
                for o_h in range(img_size):
                    for p_w in range(param_size):
                        for p_h in range(param_size):
                            out[c_out][o_w][o_h] += \
                            param[c_in][c_out][p_w][p_h] * obs_pad[c_in][o_w+p_w][o_h+p_h]
    return out    

def _output(inp: np.ndarray, param: np.ndarray) -> np.ndarray:
    '''
    obs: [batch_size, channels, img_width, img_height]
    fil: [in_channels, out_channels, fil_width, fil_height]    
    '''
    outs = [_compute_output_obs(obs, param) for obs in inp]    
    return np.stack(outs)

Note that `_pad_2d_channel` is a function that pads the input along the channel dimension.

Again, the actual code that does the computation is similar to the code in the simpler 2D case (without channels) shown before, except now we have, for example, `fil[c_out][c_in][p_w][p_h]` instead of just `fil[p_w][p_h]`, since there are two more dimensions and `c_out × c_in` more elements in the filter array.

##### Backward pass
The backward pass is similar and follows the same conceptual principles as the backward pass in the simple 2D case:
1. For the input gradients, we compute the gradients of each observation individually—padding the output gradient to do so—and then stack the gradients.

2. We also use the padded output gradient for the parameter gradient, but we loop through the observations as well and use the appropriate values from each one to update the parameter gradient.

Here’s the code for computing the output gradient:

In [13]:
def _compute_grads_obs(input_obs:np.ndarray, output_grad_obs:np.ndarray, param:np.ndarray) -> np.ndarray:
    '''
    input_obs: [in_channels, img_width, img_height]
    output_grad_obs: [out_channels, img_width, img_height]
    param: [in_channels, out_channels, img_width, img_height]    
    '''
    input_grad = np.zeros_like(input_obs)    
    param_size = param.shape[2]
    param_mid = param_size // 2
    img_size = input_obs.shape[1]
    in_channels = input_obs.shape[0]
    out_channels = param.shape[1]
    output_obs_pad = _pad_2d_channel(output_grad_obs, param_mid)
    
    for c_in in range(in_channels):
        for c_out in range(out_channels):
            for i_w in range(input_obs.shape[1]):
                for i_h in range(input_obs.shape[2]):
                    for p_w in range(param_size):
                        for p_h in range(param_size):
                            input_grad[c_in][i_w][i_h] += \
                            output_obs_pad[c_out][i_w+param_size-p_w-1][i_h+param_size-p_h-1] \
                            * param[c_in][c_out][p_w][p_h]
    return input_grad

def _input_grad(inp:np.ndarray, output_grad:np.ndarray, param:np.ndarray) -> np.ndarray:
    grads = [_compute_grads_obs(inp[i], output_grad[i], param) for i in range(output_grad.shape[0])]    
    return np.stack(grads)

And here’s the parameter gradient:

In [14]:
def _param_grad(inp:np.ndarray, output_grad:np.ndarray, param:np.ndarray) -> np.ndarray:
    '''
    inp: [in_channels, img_width, img_height]
    output_grad_obs: [out_channels, img_width, img_height]
    param: [in_channels, out_channels, img_width, img_height]    
    '''
    param_grad = np.zeros_like(param)    
    param_size = param.shape[2]
    param_mid = param_size // 2
    img_size = inp.shape[2]
    in_channels = inp.shape[1]
    out_channels = output_grad.shape[1]    

    inp_pad = _pad_conv_input(inp, param_mid)
    img_shape = output_grad.shape[2:]

    for i in range(inp.shape[0]):
        for c_in in range(in_channels):
            for c_out in range(out_channels):
                for o_w in range(img_shape[0]):
                    for o_h in range(img_shape[1]):
                        for p_w in range(param_size):
                            for p_h in range(param_size):
                                param_grad[c_in][c_out][p_w][p_h] += \
                                inp_pad[i][c_in][o_w+p_w][o_h+p_h] \
                                * output_grad[i][c_out][o_w][o_h]
    return param_grad

These three functions - `_output`, `_input_grad`, and `_param_grad` - are just what we need to create a `Conv2DOperation`, which will ultimately form the core of the `Conv2DLayers` we’ll use in our `CNN`! There are just a few more details to work out before we can use this `Operation` in a working convolutional neural network.


## 5.4 Using This Operation to Train a CNN
We need to implement a few more pieces before we can have a working `CNN` model:
1. We have to implement the `Flatten` operation discussed earlier in the chapter; this is necessary to enable the model to make predictions.

2. We have to incorporate this `Operation` as well as the `Conv2DOpOperation` into a `Conv2D` Layer.

3. Finally, for any of this to be usable, we have to write a faster version of the `Conv2D` Operation. We’ll outline this here and share the details in "Matrix Chain Rule" on `page 221`.

### The Flatten Operation
There’s one other `Operation` we’ll need to complete our convolutional layer: the `Flatten` operation. The output of a convolution operation is a 3D `ndarray` for each observation, of dimension `(channels, img_height, img_width)`.

However, unless we are passing this data into another convolutional layer, we’ll first need to transform it into a `vector` for each observation. Luckily, as described previously, since each of the individual neurons involved encodes whether a particular visual feature is present at that location in the image, we can simply "flatten" this 3D `ndarray` into a 1D `vector` and pass it forward without any problem. The `Flatten` operation shown here does this, accounting for the fact that in convolutional layers, as with any other layer, the first dimension of our `ndarray` is always the batch size:

```python
class Flatten(Operation):
    def __init__(self):
        super().__init__()

    def _output(self) -> ndarray:
        return self.input.reshape(self.input.shape[0], -1)

    def _input_grad(self, output_grad: ndarray) -> ndarray:
        return output_grad.reshape(self.input.shape)
```

That’s the last `Operation` we’ll need; let’s wrap these `Operations` up in a Layer.

### The Full Conv2D Layer
The full convolutional layer, then, would look something like this:

```python
class Conv2D(Layer):
    def __init__(self, out_channels:int, param_size:int, 
                 activation:Operation=Sigmoid(), flatten:bool=False) -> None: 
        super().__init__() 
        self.out_channels = out_channels
        self.param_size = param_size 
        self.activation = activation 
        self.flatten = flatten

    def _setup_layer(self, input_: ndarray) -> ndarray:
        self.params = [] 
        conv_param = np.random.randn(self.out_channels,
                                     input_.shape[1],
                                     self.param_size, 
                                     self.param_size)
        self.params.append(conv_param)
        self.operations = [] 
        self.operations.append(Conv2D(conv_param)) 
        self.operations.append(self.activation)
        if self.flatten:
            self.operations.append(Flatten())
        return None
```

The `Flatten` operation is optionally added on at the end, depending on whether we want the output of this layer to be passed forward into another convolutional layer or passed into another fully connected layer for predictions.

##### A note on speed, and an alternative implementation
As those of you who are familiar with computational complexity will realize, this code is catastrophically slow: to calculate the parameter gradient, we needed to write seven nested for loops! There’s nothing wrong with doing this, since the purpose of writing the convolution operation from scratch was to solidify our understanding of how `CNN` work. Still, it is possible to write convolutions in a completely different way; instead of breaking down that process like we have in this chapter, we can break it down into the following steps:
1. From the input, extract $\textit{image_height}\times\textit{image_width}\times\textit{num_channels}$ patches of size $\textit{filter_height}\times\textit{filter_width}$ from the test set.

2. For each of these patches, perform a dot product of the patch with the appropriate filter connecting the input channels to the output channels.

3. Stack and reshape the results of all of these dot products to form the output.

With a bit of cleverness, we can express almost all of the operations described previously in terms of a batch matrix multiplication, implemented using `NumPy`’s `np.matmul` function. The details of how to do this are described in `Appendix A` and are implemented on the book’s website, but suffice it to say that this allows us to write relatively small convolutional neural networks that can train in a reasonable amount of time. This lets us actually run experiments to see how well convolutional neural networks work!

### Experiments
Even using the convolution operation defined by reshaping and the matmul functions, it takes about $10$ minutes to train this model for one epoch with just one convolutional layer, so we restrict ourselves to demonstrating a model with just one convolutional layer, with $32$ channels (a number chosen somewhat arbitrarily):

In [15]:
# 下载Minst数据集
init_mnist("./data")

# 加载Mnist数据集
X_train, y_train, X_test, y_test = load_mnist("./data")

# one-hot encode
num_labels = len(y_train)
train_labels = np.zeros((num_labels, 10))
for i in range(num_labels):
	train_labels[i][y_train[i]] = 1

log.info("trainset has %d samples" % (num_labels))

num_labels = len(y_test)
test_labels = np.zeros((num_labels, 10))
for i in range(num_labels):
	test_labels[i][y_test[i]] = 1

log.info("testset has %d samples" % (num_labels))

# 标准化
X_train, X_test = X_train-np.mean(X_train), X_test-np.mean(X_train)
X_train, X_test = X_train/np.std(X_train), X_test/np.std(X_train)

# Conv
X_train_conv, X_test_conv = X_train.reshape(-1, 1, 28, 28), X_test.reshape(-1, 1, 28, 28)

[smartflow] file `train-images-idx3-ubyte.gz` already exist
[smartflow] file `t10k-images-idx3-ubyte.gz` already exist
[smartflow] file `train-labels-idx1-ubyte.gz` already exist
[smartflow] file `t10k-labels-idx1-ubyte.gz` already exist
[smartflow] download complete.
[smartflow] file `mnist.pkl` already exist
[smartflow] load mnist dataset...
[smartflow] trainset has 60000 samples
[smartflow] testset has 10000 samples


In [None]:
model = NeuralNetwork(
    layers=[Conv2DLayer(out_channels=32, param_size=5, dropout=0.8, 
                   weight_init="glorot", flatten=True, activation=Tanh()), 
            DenseLayer(neurons=10, activation=Linear())], 
    loss = SoftmaxCrossEntropy(), 
    seed=20190402)

trainer = Trainer(model, SGDMomentum(lr = 0.1, momentum=0.9))
trainer.fit(X_train_conv, train_labels, X_test_conv, test_labels,
			epochs = 1,
			eval_every = 1,
			seed=20190402,
			batch_size=60,
			conv_testing=True);

[smartflow] batch#0 loss: 33.21654456232465
[smartflow] batch#10 loss: 6.721844280611935
[smartflow] batch#20 loss: 8.171938533609682
[smartflow] batch#30 loss: 7.790772208381125
[smartflow] batch#40 loss: 2.7634509970577077
[smartflow] batch#50 loss: 4.878111909339151
[smartflow] batch#60 loss: 4.573424441829335
[smartflow] batch#70 loss: 6.217209363956173
[smartflow] batch#80 loss: 4.482195659373201
[smartflow] batch#90 loss: 1.5586921926342188
[smartflow] batch#100 loss: 4.6243814464151844
