# Deep Learning Frameworks

Deep learning frameworks, like Chainer, PyTorch (which copied Chainer's design) and TensorFlow,
are designed to make writing neural networks easy.
Think back to the tutorial notebook:
There were lots of places where we wrote repetitive code.
We definitely could have benefited from a framework to help us.


## What does a deep learning framework give you?

At its core, a deep learning framework gives you the tools needed to handle models, losses and optimizers. It basically gives you a library of:

1. functions that you can chain together to express your model
1. loss functions that you can compose together to express your final loss function(s)
1. gradient-based optimizers that you can use to optimize your parameters

Not to mention, because gradients are so important, the library should also provide
an automatic differentiation system to produce a gradient function,
or it should implement the gradients of each model function.

Beyond that, it can (and probably should) also provide:

1. convenience functions for instantiating parameters without needing you to manually specify their shapes
1. a syntax for specifying neural network models conveniently
1. tooling to help you speed up the training of neural networks, whether it be by distributed training or some form of compilation

## Reinventing the wheel to learn the wheel

To learn how a deep learning framework is structured, we are going to attempt to make one. 
After all, there is no better way to learn about a wheel than reinventing it.

### Course design choices

In designing this section of the course, I have made a few choices w.r.t. how to structure this section.

Firstly, we are adopting a "functional"-first approach to building a deep learning framework.
The main reason is that I would like you to see how the math functions that we write in NumPy
get translated into neural network layers.
They operate on "data", which are both the parameters of the model and the data we train the model on.

For this reason, we are also secondly avoiding the use of the large frameworks, PyTorch and TensorFlow.
This is not to say that they are useless; much to the contrary, learning them is _extremely_ useful.
But if the goal is to uncover what they provide, we should stay away from them until we learn the fundamentals.

Thirdly, we are going to stick with `jax`, as it is the only system availble at the moment
that provides automatic differentiation on the NumPy API.
Being able to express our layers in the NumPy API has some advantages
in that we can stick with PyData community idioms,
Coincidentally, in `jax`, there are experimental libraries that provide a pedagogical view
on what goes into a deep learning library, 
and in this section, we will use the interplay of mixing and matching them
to give you a broad starting view of DL libraries, 
thus helping you avoid mental lock-in as you go onto the big libraries.

To get us started, we are going to use the neural network example from before to anchor us.

In [None]:
import pandas as pd

X = pd.read_csv('../data/biodeg_X.csv', index_col=0)
y = pd.read_csv('../data/biodeg_y.csv', index_col=0)

## Layers of Functions: The Dense Transformation

Our model structure was expressed as the following:

```python
def nn_model(p, x):
    # "a1" is the activation from layer 1
    a1 = np.tanh(np.dot(x, p['w1']) + p['b1'])
    # "a2" is the activation from layer 2.
    # Explain why we need logistic at the end.
    a2 = logistic(np.dot(a1, p['w2']) + p['b2'])
    return a2
```

Firstly, notice how we keep using the line:

```python
np.dot(x, p["w"] + p["b"])
```

In neural network land, we call that dot product a "Dense" transformation, and is basically the most commonly used linear algebra operation in most of deep learning. 
This can be expressed in the following form:

In [None]:
import jax.numpy as np

def dense(params, x):
    return np.dot(x, params["w"]) + params["b"]

With that, our `nn_model` can be more concisely expressed as:

In [None]:
def logistic(x):
    return 1 / (1 + np.exp(-x))

def nn_model(p, x):
    a1 = np.tanh(dense(p["dense1"], x))
    a2 = logistic(dense(p["dense2"], a1))
    return a2

You should expect that a "Dense" layer be provided in any deep learning library.

Now, if providing a `Dense` layer is all that we expected of deep learning libraries, that would still be a little too easy.

## Parameters

Now that we've figured out how to create a "dense" layer, we also need to instantiate the model parameters.

Now, note how, though, we cannot directly access the parameters `w1`, `b1`, `w2` and `b2`.
We now have a bit of indirection and nesting necessary to keep the `dense` function general.

Since a `dense` layer requires both `w` and `b`, we can write a function that instantiates random numbers for them,
with correctly specified shapes.

In [None]:
import numpy.random as npr

def dense_params(params, name, input_dim, output_dim):
    params[name] = dict()
    params[name]["w"] = npr.normal(size=(input_dim, output_dim))
    params[name]["b"] = npr.normal(size=(output_dim,))
    return params

params = dict()
params = dense_params(params, "dense1", 41, 20)
params = dense_params(params, "dense2", 20, 1)
params["dense2"]

Testing whether the params work, we can simply pass the data and params through the model:

In [None]:
nn_model(params, X.values).shape

And with that, we've really greatly reduced the amount of boilerplate we need to write when instantiating parameters!

## Optimization Routine and Updating

If you remember, our training loop generally looked something like the following:

```python
losses = []
for i in tqdmn(range(1000)):
    grad_p = dlogistic_loss(params, X.values, y.values)
    for k, v in params.items():
        params[k] = params[k] - grad_p[k] * 0.0001
    losses.append(logistic_loss(params, X.values, y.values))
```

This design works if our parameters are one-layer nested dictionaries that have the following structure:

```python
params = {
    "layer1": {
        "param1": np.array...,
        "param2": np.array...,
    },
    "layer2": {
        "param1": np.array...,
        "param2": np.array...,
        "param3": np.array...,
    },
}
```

## Automating Parameter Shape Construction

Now, parameters are also important keep in mind. The `dense` layer that we defined above _assumes_ the following shapes for our data and params:

```
# Note: ":" represents the "sample" dimension.
(:, 41) @ (41, 20) + (20,)   => (:, 20)
x       @ p["w"]   + p["b"]  => out
```

Recall that when we initialized our parameters, we had to manually specify the parameter names and their shapes.

```python
params = dict()
params['w1'] = noise((41, 20))
params['b1'] = noise((20,))
params['w2'] = noise((20, 1))
params['b2'] = noise((1,))
```

Doing this manually is extremely tedious! Surely there has to be an easier way to specify them?

If we think carefully about what we had to do here, when we specify a neural network layer (such as `dense`), we need to specify the input and output dimensions, ignoring the "sample" dimension (by convention, the first dimension in our X and y matrices). So in our example above, the input dimension is `41`, the output dimension is `20`. 

Consequently, the output dimension of the first `dense` layer (`20`) is also the input dimension of the second layer. 

If we were to design an API (basically a language), we would want to simplify the specification of neural net layers, perhaps using the following API sketch as a basis:

```python
model = (
    Dense(20), Tanh,
    Dense(1), Logistic,
)
```

To do this, we will need a way to automatically generate parameters that are of the correct shape.

## `stax`

What `jax` provides is an experimental module called `stax`, which we can leverage to provide a pedagogical view on how to do this.
In `stax`, every neural network layer, which is a function itself that takes in its data's input shape, returns a pair of functions, the `init_fun` and `apply_fun`. 

- the `init_fun` returns the output shape of the layer, as well as parameters that are intialized with the correct shape
- the `apply_fun` takes in parameters and data, and returns the output, also correctly shaped.

Let's study the `Dense` implementation in `stax`:

```python
def Dense(out_dim, W_init=glorot_normal(), b_init=normal()):
    """Layer constructor function for a dense (fully-connected) layer."""
    def init_fun(rng, input_shape):
        output_shape = input_shape[:-1] + (out_dim,)
        k1, k2 = random.split(rng)
        W, b = W_init(k1, (input_shape[-1], out_dim)), b_init(k2, (out_dim,))
        return output_shape, (W, b)

    def apply_fun(params, inputs, **kwargs):
        W, b = params
        return np.dot(inputs, W) + b

    return init_fun, apply_fun
```

### `apply_fun`

The `apply_fun` is easy to understand: it looks identical to the `dense` layer that we defined above, except that the parameters are defined as a tuple intead of a dictionary.

### `init_fun`

In the `init_fun`, the output shape is defined in a very general fashion. If we structure a dot product using:

```python
np.dot(inputs, W)
```

then the number of dimensions in the `input` doesn't matter, as long as the last dimension matches with the first dimension of `W`:

- `(n_sample, n_input_dims) @ (n_input_dims, n_output_dims)` will execute correctly
- `(n_sample, n_nuisance_dims, n_input_dims) @ (n_input_dims, n_output_dims)` will execute correctly.
- `(n_sample, n_nuisance_dims1, ..., n_input_dims) @ (n_input_dims, n_output_dims)` will execute correctly.

As such, the output shape is defined as the last dimension of the input shape and the output dim.

`W_init` and `b_init` are nothing more than random number generators. For now, only the shapes matter:

- `W` gets shape `(input_dims[-1], output_dims)`
- `b` gets shape `(output_dims,)`

For now, let's ignore what `glorot_normal()` and `normal()` exactly are; it is sufficient for us to call them random number generators.

### Using `stax.Dense`

Let's put `stax.Dense` to use so that you can see how it works.

In [None]:
from jax.experimental.stax import Dense
from jax.random import PRNGKey

k = PRNGKey(42)  # you can put any arbitrary number there

init_fun, apply_fun = Dense(20)
output_shape, params = init_fun(k, input_shape=(-1, 41))
params[0].shape, params[1].shape

In [None]:
apply_fun(params, X.values).shape

## Chaining layers with `stax.serial`

And now we can start chaining the Dense layers together by using `stax.serial`. 
`stax.serial` does a few nice magical things under the hood, and it all stems from the implementation:

```python
def serial(*layers):
    """Combinator for composing layers in serial.

    Args:
    *layers: a sequence of layers, each an (init_fun, apply_fun) pair.

    Returns:
    A new layer, meaning an (init_fun, apply_fun) pair, representing the serial
    composition of the given sequence of layers.
    """
    nlayers = len(layers)
    init_funs, apply_funs = zip(*layers)

    def init_fun(rng, input_shape):
        params = []
        for init_fun in init_funs:
            rng, layer_rng = random.split(rng)
            
            # THE MONEY LINE IS HERE! LOOK HERE!
            input_shape, param = init_fun(layer_rng, input_shape)
            params.append(param)
        return input_shape, params

    def apply_fun(params, inputs, **kwargs):
        rng = kwargs.pop('rng', None)
        rngs = random.split(rng, nlayers) if rng is not None else (None,) * nlayers
        for fun, param, rng in zip(apply_funs, params, rngs):
            inputs = fun(param, inputs, rng=rng, **kwargs)
        return inputs

    return init_fun, apply_fun
```

In here, the key thing to notice is the loop in `init_fun`:

```python
for init_fun in init_funs:
    rng, layer_rng = random.split(rng)
    input_shape, param = init_fun(layer_rng, input_shape)
```

Through this, the `output_shape` of one layer becomes the `input_shape` of the next,
thus allowing the parameters to be specified correctly
without needing to explicitly and manually specify shapes.
(We must admit, after all, that it's simply frustrating
to manually keep track of them all!)

## Using `stax.serial`

Let's put it to use to make sure we can see how it works.

In [None]:
from jax.experimental.stax import Tanh, Elu, serial
init_fun, apply_fun = serial(
    Dense(20),
    Elu,
    Dense(1),
    Tanh
)

output_shape, params = init_fun(k, input_shape=(-1, 41))
output_shape

In [None]:
params[0][0].shape, params[0][1].shape

In [None]:
params[2][0].shape, params[2][1].shape

In [None]:
apply_fun(params, X.values).shape

As you can see, by leveraging **compositionality**, we can use simple syntax to chain layers together without needing to manually specify input/output shapes. 