# Spec 0.1

In [9]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

This is similar to NAS with RL. There, the actions were decided hyperparameters; here, actions are deciding which weights to go to. Rather than update weights (such as with backpropagation), we move to certain weights (i.e., dettached layers).

The reward is the loss at the end

Each detached layer retains the same data dimensionality, except for the dense layer

Dense layers are a possible action

### Notes

What information does the gradient give us? How can we use it?
- The partial derivatives tell us the rate of change of the loss due to that weight. It is a vector of slopes. So then we have some interesting pieces of information:
    - the weight 
    - the weight's rate of change
    - the weight's importance (in terms of the partial derivative of the loss w.r.t. this weight)

- A connection is that if I knew how a weight was changing, then I could know what to look for (here, what next layer we want because it has that weight at a desired value).

When does the network stop?
- One possibility is in the case of a CNN, it stops when it reaches the dense layer (i.e., when the RL agent chooses that for its action). 

What is the reward for the controller?
- The loss at the end of the chain of attached layers.

Using the gradient at each layer
- Keep track of information about the weights by index
- This is one benefit of having the same dimension for the layer/weights in each detached layer
- The trend for a weight can indicate what change we want to make
- But maybe this doesn't work because each change to the weight makes sense in terms of the context (i.e., the changes to all the other weights; e.g., a certain change in weight be unecessary if we make some other change to a different weight)

What is my state space? Is it continuous or discrete?
- If my states are the detached layers, then one state will have the same value regardless of where in the sequence of layers it and by which layers was it preceded - but of course that information (sequence and composition) are critically important to how the built network performs
- Discrete: Each state is a network (i.e., the network after that action, which was the attachment of some layer)
- Continuous: If each state is a network, how can approximate networks and use a continuous model?
    - We can represent the network as a vector with something like [number_of_detached_layers, weights]
    - The downside of this approach is there is a set number of layers (set architecture)

Do the weights of individual detached layers get updated? When do they get updated?
- TODO

How is the value of the state (so that detached layer in this network, so far) getting updated?
- TODO

Is there an opportunity to use square matrices and the inverse property to engineer backwards a solution? Can we work back from some output to the matrices (weights) we need?
- TODO

In [4]:
class DetachedLayer(nn.Module):
    '''
    Layer that will be chained with other layers by the
    RL agent.
    '''
    def __init__(self, depth=1):
        super.__init__()
        self.depth = depth
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)  # final output has 10 classes

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        # the '1' in flatten keeps the first dimension,
        # which is the batch size 4
        x = torch.flatten(x, 1) 
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
