## Pytorch NN Base
The basic workflow is:
* Define a network (learnable parameters or weights)
* *Iterate* over a dataset of inputs
* Forward process the inputs (with some initial weights).
* Compute loss 
* Backprop to the parameter
* Update weights (using *some rule* and *learning rate*).

But why do we need the package? We can do these just using numpy, yeah? And that is exactly what [this tutorial](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html) starts with, building a network using just numpy. But numpy is for generic scientific compution. It has no 'understanding' of computational graphs, or learning, or gradients. 

Pytorch is a package that provides this 'understanding' or abstraction. 
* tensors as fundamental objects, keeps track of computations on them.
* provides GPU support (numpy does not). GPU speedups can be an order of magnitude (or two, or more?). 
* provides automatic differentiation for building and training NN's

Warning: probably make more sense later, but pytorch expects data in 'batches'. If sending just a single sample, it is better to send in some fake data as well to make it look like a batch. Think of $N$ as the batch size (for the input) and $D$ as the dimensionality of each input.

In ML, and DL, dimensions can be very large, especially for inputs (think image or text data).

### Defining networks
Example starts with numpy to fit a two-layer network. Layer 1 connects input to hidden layer, and layer 2 connects the hidden layer to output. The output from hidden layer is rectified using a simple $\max(0,x)$ type reLU.

In [1]:
import numpy as np

#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension, e.g. 10 digits [0-9], for numbers image data

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# initial weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

#learning rate (fixed model paramter)
learning_rate = 1e-6

#now the network model
for t in range(500): #100 epochs (will be much less with real data or might take too long)
    #forward pass
    h = x.dot(w1) 
    h_relu = np.maximum(h,0) #very simple rectifier unit
    y_pred = h_relu.dot(w2)
        
    #compute and print the loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    #Backprop to compute gradients of w1 and w2 using loss as objective
    grad_y_pred = 2.0*(y_pred -y) #derivative of square, note 2.0 for float calculations
    grad_w2 = h_relu.T.dot(grad_y_pred) #transpose then dot product
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy() #why?
    grad_h[h<0] = 0 #of, because, it is either 0, or the identity function (max is not differentiable)
    grad_w1 = x.T.dot(grad_h)
    
    #Now update the weights using the gradients for w2 and w1.      
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2 

0 33477673.68643208
1 30163472.302399434
2 30025893.258036844
3 27972180.24679287
4 22436725.352094945
5 14991097.401314098
6 8768192.023958005
7 4841633.966364619
8 2774096.7126852125
9 1739555.99338177
10 1209946.3467236017
11 914202.0692721449
12 730333.2779482367
13 603400.4122178403
14 508587.817847029
15 434083.05234760814
16 373596.5583529949
17 323561.6016558534
18 281743.97556710034
19 246432.49015176835
20 216440.86522264237
21 190770.3797779755
22 168717.04928244068
23 149655.56266308087
24 133112.70239140943
25 118707.26708087689
26 106122.73058705927
27 95083.47785528918
28 85375.56885176737
29 76816.11342870211
30 69243.23007321885
31 62524.17893362534
32 56552.17007547518
33 51234.888898544494
34 46489.243314375024
35 42243.490731478414
36 38440.896908627896
37 35027.6036973638
38 31958.012224317776
39 29193.40987652194
40 26701.098515020276
41 24448.70628022482
42 22410.481514151365
43 20563.03213570833
44 18887.783058022505
45 17365.106217276563
46 15979.624183769389
4

359 0.0016755172490401688
360 0.0016011739751992451
361 0.001530173788556661
362 0.0014623129772910904
363 0.0013974530228470325
364 0.001335475163387195
365 0.0012762502125164125
366 0.0012196887400819925
367 0.0011656169884310592
368 0.001113947544976227
369 0.0010645990345712392
370 0.0010174127105160916
371 0.0009723268063867302
372 0.000929244349613985
373 0.0008880789987439038
374 0.0008487351373741726
375 0.0008111428864412682
376 0.0007752445552687547
377 0.0007409240027889893
378 0.0007081208853051261
379 0.0006767713395224513
380 0.0006468081713395025
381 0.0006181783229905111
382 0.0005908167078099657
383 0.000564668611586825
384 0.000539692431386522
385 0.0005158263585015113
386 0.0004930190914183251
387 0.00047120996653256773
388 0.00045036746908781993
389 0.0004304477361971123
390 0.00041141397792666896
391 0.00039321977353424034
392 0.0003758399669501524
393 0.0003592250861500034
394 0.0003433460699179226
395 0.00032817038977866315
396 0.0003136724947662733
397 0.0002998

## Moving to pytorch
Pytorch *tensors* are like numpy arrays, but they also keep track of computations on them. The computation history can be turned off to use them as data structures for general scientific computing (using the helper functions that come with torch) as a substitute for numpy. The focus here is on apps for DL.

To run pytorch tensors on **GPUs**, we simply cast them as as a ** new datatype**.

To start with, we implement the above network using tensors instead of numpy arrays. 

In [19]:
import torch

dtype = torch.float
device = torch.device("cpu") #change to "cuda:0" to run on GPU

#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data (requires_grad=False is default). No need to keep track (i.e. no gradients) for
# these tensors
x = torch.randn(N, D_in, dtype=dtype, device=device)
y = torch.randn(N, D_out,dtype=dtype, device=device)

# initial weights. We do need gradients for these
w1 = torch.randn(D_in, H, dtype=dtype, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=dtype, device=device, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    #Forward pass
    h = x.mm(w1) #mm is matrix mult.
    h_relu = h.clamp(min=0) ##clamp can be used to filter based on min, max
    y_pred = h_relu.mm(w2)
    
    #compute and print loss
    loss = (y_pred - y).pow(2).sum().item()  #extract scalar tensor (1x1) as python number
    print(t, loss)
    
    #Backprop gradients
    grad_y_pred = 2.0*(y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h<0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    #print('grad_w1 size \n', grad_w1.size())
    #print('w1 size \n', w1.size())

    #update weights using simple grad descent
    w1 = w1 - learning_rate*grad_w1  #using -= (in place computation leads to warning of leaf node with autograd)
    w2 = w2 - learning_rate*grad_w2

0 32299612.0
1 35925028.0
2 49022744.0
3 61869480.0
4 57583924.0
5 33232156.0
6 11806640.0
7 3674725.5
8 1637399.625
9 1085680.25
10 853633.5625
11 706969.75
12 596168.6875
13 507610.03125
14 435396.4375
15 375783.4375
16 326120.84375
17 284483.40625
18 249301.15625
19 219362.203125
20 193733.96875
21 171709.765625
22 152678.203125
23 136179.046875
24 121805.9375
25 109234.546875
26 98186.2578125
27 88454.4921875
28 79851.0078125
29 72225.390625
30 65476.87890625
31 59472.51171875
32 54107.10546875
33 49304.8359375
34 44992.41015625
35 41111.60546875
36 37614.71484375
37 34458.73828125
38 31604.517578125
39 29020.0546875
40 26675.3125
41 24545.546875
42 22608.013671875
43 20842.935546875
44 19233.533203125
45 17763.560546875
46 16419.09765625
47 15188.6904296875
48 14061.4912109375
49 13027.3818359375
50 12078.0634765625
51 11205.16796875
52 10403.890625
53 9667.1416015625
54 8987.90625
55 8361.7529296875
56 7784.4716796875
57 7250.6826171875
58 6757.2626953125
59 6300.646484375
60 587

380 0.0013538445346057415
381 0.001311617437750101
382 0.0012705361004918814
383 0.001230783062055707
384 0.00119133316911757
385 0.001154109719209373
386 0.0011174723040312529
387 0.0010820638854056597
388 0.001050214865244925
389 0.0010179555974900723
390 0.0009867447661235929
391 0.0009562592604197562
392 0.0009285410051234066
393 0.0008998559205792844
394 0.0008763383375480771
395 0.0008495483198203146
396 0.0008246278739534318
397 0.0008010355522856116
398 0.0007785066845826805
399 0.0007550838054157794
400 0.0007353410474024713
401 0.0007136947242543101
402 0.0006936751306056976
403 0.0006751532200723886
404 0.0006548620876856148
405 0.0006376980454660952
406 0.0006205749232321978
407 0.0006031302618794143
408 0.0005877164076082408
409 0.0005723702488467097
410 0.0005583013407886028
411 0.0005447372677735984
412 0.0005293367430567741
413 0.0005154319223947823
414 0.0005035390495322645
415 0.000489799480419606
416 0.000476627959869802
417 0.0004640970437321812
418 0.00045409167069

### Using autograd
Instead of manual computation of derivatives, we can use autograd. When using autograd, the *forward pass* methods define a **computational graph** where the tensors are nodes and functions are edges connecting tensors (via transformations). By keeping track of these, it is easier to compute gradients by backpropagating over the graph.

Note: The graph will generally not be a tree as (1) multidimenional output, (2) the number of hidden layers (and nodes therein). Stick to general graphs, and not trees. 

Implement the same network as before, but now using autograd.

In [21]:
import torch

dtype = torch.float
device = torch.device("cpu")

#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data (requires_grad=False is default). No need to keep track (i.e. no gradients) for
# these tensors
x = torch.randn(N, D_in, dtype=dtype, device=device)
y = torch.randn(N, D_out,dtype=dtype, device=device)

# initial weights. We do need gradients for these
w1 = torch.randn(D_in, H, dtype=dtype, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=dtype, device=device, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    #what changes in the forward pass now is that we no longer have to keep track of intermediate values
    #autograd will do that for us.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    #loss: keep loss as a tensor, for printing extract scalar using .item()
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    #Use autograd for backprop. 
    loss.backward()#compute grad of loss w.r.t. all tensors for which require_grads=True (w1 and w2 here)
    #No need to define grad_w1, grad_w2. These will be computed as objects w1.grad and w2.grad
    
    #Update weights (manually)
    with torch.no_grad(): #no need to keep track of the following computations, no gradients needed
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        #same as calling torch.optim.SGD, which is the way to do it especially with lots of layers
        
        #Imp: zero out the gradients for the next iteration.
        w1.grad.zero_()
        w2.grad.zero_()

0 34415588.0
1 33386760.0
2 35780908.0
3 35138592.0
4 28675630.0
5 18320424.0
6 9833536.0
7 4948503.5
8 2697718.25
9 1694921.5
10 1216936.625
11 952948.0
12 783018.5
13 659800.125
14 563739.875
15 485914.3125
16 421567.96875
17 367619.65625
18 321902.71875
19 282932.75
20 249529.859375
21 220759.328125
22 195870.25
23 174241.96875
24 155373.84375
25 138850.40625
26 124347.234375
27 111619.03125
28 100390.6015625
29 90460.1328125
30 81656.984375
31 73822.890625
32 66842.09375
33 60624.8125
34 55057.94921875
35 50065.01171875
36 45575.78125
37 41536.30859375
38 37892.47265625
39 34601.05078125
40 31626.505859375
41 28932.486328125
42 26498.1328125
43 24283.875
44 22272.310546875
45 20441.794921875
46 18776.1640625
47 17256.494140625
48 15870.2197265625
49 14604.314453125
50 13447.177734375
51 12388.849609375
52 11419.7900390625
53 10531.51171875
54 9717.6640625
55 8971.185546875
56 8286.2763671875
57 7658.78955078125
58 7081.4921875
59 6550.2900390625
60 6061.23046875
61 5610.70263671875

382 0.00016171483730431646
383 0.0001582498662173748
384 0.00015422333672177047
385 0.00014980621926952153
386 0.00014696385187562555
387 0.00014325529627967626
388 0.00013921360368840396
389 0.00013566140842158347
390 0.00013289703929331154
391 0.00013020589540246874
392 0.00012740895908791572
393 0.00012420416169334203
394 0.00012136372970417142
395 0.00011900091340066865
396 0.00011613456445047632
397 0.00011347702820785344
398 0.00011090640327893198
399 0.00010910756827797741
400 0.00010623827984090894
401 0.0001035967143252492
402 0.0001013209403026849
403 9.882591257337481e-05
404 9.746110299602151e-05
405 9.530911484034732e-05
406 9.312695328844711e-05
407 9.063066681846976e-05
408 8.9504464995116e-05
409 8.794963650871068e-05
410 8.573201921535656e-05
411 8.443597471341491e-05
412 8.270327089121565e-05
413 8.094950317172334e-05
414 7.973020547069609e-05
415 7.780196028761566e-05
416 7.638037641299888e-05
417 7.516311598010361e-05
418 7.378451118711382e-05
419 7.228655886137858e

### A deeper look at autograd
The autograd operators are a combination of **forward** and **backward** functions. The forward function computes output tensors from input tensors. That's straightforward.

The backward function *receives the gradient* (partial derivative) of the output-tensors (from the forward function) w.r.t. to "some scalar" and *computes the gradient* (partial derivative) of the input tensors w.r.t the same scalar.

To understand this better, let us define custom autograd fuctions for reLU non-linearity and reimplement the above network. **Custom autograd** functions can be created by defining subclass of *torch.autograd.Function*.

In [3]:
import torch

class MyReLU(torch.autograd.Function): #a subclass that inherits from class autograd.Function
    """
    implement our own forward and backward functions
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        Forward pass converts tensor 'input' and returns an output tensor. 'ctx' is a context object to include 
        information about the backward pass. Arbitrary objects can be cached for use during the backward computation
        using "ctx.save_for_backward" method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        Recives tensor containing the gradient of the loss w.r.t. output. Need to compute gradient of loss 
        w.r.t. input
        """
        input, = ctx.saved_tensors #note the ',' after 'input'  Read doc for ctx.save
        grad_input = grad_output.clone()
        grad_input[input<0]=0
        return grad_input

dtype = torch.float
device = torch.device("cpu")

#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data (requires_grad=False is default). No need to keep track (i.e. no gradients) for
# these tensors
x = torch.randn(N, D_in, dtype=dtype, device=device)
y = torch.randn(N, D_out,dtype=dtype, device=device)

# initial weights. We do need gradients for these
w1 = torch.randn(D_in, H, dtype=dtype, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, dtype=dtype, device=device, requires_grad=True)

learning_rate = 1e-6

for t in range(500):
    #use method Function.apply for our custom function (apply to the subclass we created).
    relu = MyReLU.apply #note: not apply(), just '.apply'

    y_pred = relu(x.mm(w1)).mm(w2)
    
    #No other changes from previous example
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
   
    loss.backward()    
    
    with torch.no_grad(): 
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
       
        w1.grad.zero_()
        w2.grad.zero_()

0 29722048.0
1 24169176.0
2 22939192.0
3 22158842.0
4 20072224.0
5 16135012.0
6 11559114.0
7 7497387.5
8 4656302.0
9 2886603.0
10 1863061.625
11 1273845.125
12 927848.25
13 713348.375
14 572011.4375
15 472620.0625
16 398679.65625
17 341124.375
18 294791.09375
19 256617.453125
20 224651.09375
21 197583.96875
22 174441.703125
23 154520.03125
24 137278.6875
25 122301.7890625
26 109251.375
27 97813.5625
28 87756.21875
29 78887.15625
30 71041.8828125
31 64085.0234375
32 57913.078125
33 52414.671875
34 47504.14453125
35 43114.48828125
36 39180.1015625
37 35648.3203125
38 32473.744140625
39 29615.662109375
40 27039.326171875
41 24710.849609375
42 22604.9140625
43 20698.7578125
44 18970.501953125
45 17400.666015625
46 15979.123046875
47 14684.9619140625
48 13505.0283203125
49 12429.1572265625
50 11447.3818359375
51 10549.712890625
52 9728.517578125
53 8976.86328125
54 8288.2236328125
55 7657.291015625
56 7078.71240234375
57 6547.5126953125
58 6059.5927734375
59 5611.06591796875
60 5198.4257812

388 0.0009739379165694118
389 0.0009474969701841474
390 0.000920682679861784
391 0.0008939544204622507
392 0.0008693021954968572
393 0.0008449486922472715
394 0.0008223360637202859
395 0.0007995872874744236
396 0.0007783115142956376
397 0.0007585431449115276
398 0.0007371088722720742
399 0.000715818430762738
400 0.0006978127057664096
401 0.0006800090195611119
402 0.0006609293632209301
403 0.0006439045537263155
404 0.0006267909775488079
405 0.0006102366605773568
406 0.0005947838071733713
407 0.0005784579552710056
408 0.0005653571570292115
409 0.0005503345746546984
410 0.0005361336516216397
411 0.0005228366353549063
412 0.0005098735564388335
413 0.0004975214251317084
414 0.0004858451138716191
415 0.0004733560199383646
416 0.0004619699320755899
417 0.00045103218872100115
418 0.00043963969801552594
419 0.00043033057590946555
420 0.00042001192923635244
421 0.00040935975266620517
422 0.00040004050242714584
423 0.00039148301584646106
424 0.00038214094820432365
425 0.00037365700700320303
426 0

### Comparison with TensorFlow
Both frameworks define computational graphs and use automatic differentiation to compute gradients. The biggest difference is tensorflow is 'static', the computational graph is defined once. The same graph is executed over and over, possibly with different input data. This is good because the entire graph can be optimized once. The cost of optimization  is spread over numerous implementations i.e. the cost gets distributed as the same structure is then used repeatedly.

In pytorch, each forward pass is a new computational graph. This is useful when certain types of inputs need to be treated differently. That's the sense in which the graph is **dynamic** because it can be made **conditional on the input** (via loops). In TensorFlow, such loops have to be made part of the graph (unrolled) from the start.

In pytorch, the computational graph is created on an as needed basis. So flow control is the same as in standard python.

### The nn module
Computational graphs and autograd are too low level for most ordinary neural network implementations. The most common abstraction is to think about creating a network as a collection of **layers**, some or all of which may have learnable parameters (weights), which will be optimized. **Keras** and **Tflearn** provide these high level abstractions for TensorFlow. In pytorch, this is provided by the **nn package**, which defines a *set of modules* which are somewhat like layers in a neuralnet. It also defines common loss functions and optimization algorithms. 

Re-implement our example using the nn package.

In [30]:
import torch

#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data (requires_grad=False is default). No need to keep track (i.e. no gradients) for
# these tensors
x = torch.randn(N, D_in, dtype=dtype, device=device,requires_grad=False)
y = torch.randn(N, D_out,dtype=dtype, device=device,requires_grad=False)

model = torch.nn.Sequential(  #sequence of layers
            torch.nn.Linear(D_in, H), #first layer. Each linear module holds internal tensors for weights and bias.
            torch.nn.ReLU(),
            torch.nn.Linear(H,D_out), #second layer
)

loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate =1e-4

for t in range(500):
    #forward pass: pass x to the model (modules can be called as functions, with Tensor inputs and output)
    y_pred = model(x)
    
    loss = loss_fn(y_pred,y)
    print(t, loss.item())
    
    model.zero_grad() #zero out the gradients before running the backward pass
    
    loss.backward() #all modules store parameters with require_grad=True, this computes gradients for\
                    #all learnable parameters. 
    #update the paramters: can be automated via optim.step() see modified example, need to define an optim object first
    with torch.no_grad(): 
        for param in model.parameters():#
            param -= learning_rate * param.grad

0 730.82421875
1 674.350341796875
2 626.1851196289062
3 584.3934326171875
4 547.1656494140625
5 513.7194213867188
6 483.4883117675781
7 455.9836730957031
8 430.5881652832031
9 406.9766540527344
10 385.00506591796875
11 364.4584655761719
12 344.910400390625
13 326.4380187988281
14 308.9134826660156
15 292.293212890625
16 276.5090026855469
17 261.4574890136719
18 247.0738067626953
19 233.38963317871094
20 220.3959503173828
21 208.03794860839844
22 196.26303100585938
23 185.0552978515625
24 174.39193725585938
25 164.2899932861328
26 154.7275390625
27 145.64393615722656
28 137.0303192138672
29 128.88214111328125
30 121.16526794433594
31 113.86836242675781
32 106.982177734375
33 100.46649932861328
34 94.29634094238281
35 88.48469543457031
36 83.02507019042969
37 77.89762878417969
38 73.08756256103516
39 68.55928802490234
40 64.29976654052734
41 60.3026008605957
42 56.54848098754883
43 53.0252685546875
44 49.726051330566406
45 46.632469177246094
46 43.73428726196289
47 41.01929473876953
48 3

355 0.000122522353194654
356 0.00011855698539875448
357 0.0001147235234384425
358 0.00011100931442342699
359 0.00010741996811702847
360 0.00010395232675364241
361 0.00010058852058136836
362 9.733974729897454e-05
363 9.419885464012623e-05
364 9.11606039153412e-05
365 8.82219392224215e-05
366 8.53805904625915e-05
367 8.263246854767203e-05
368 7.997483044164255e-05
369 7.74021609686315e-05
370 7.491294672945514e-05
371 7.251047645695508e-05
372 7.017891766736284e-05
373 6.792691419832408e-05
374 6.57469718134962e-05
375 6.364024739013985e-05
376 6.160056364024058e-05
377 5.9628237067954615e-05
378 5.771935684606433e-05
379 5.586957558989525e-05
380 5.408489232650027e-05
381 5.2355488151079044e-05
382 5.068365135230124e-05
383 4.906713365926407e-05
384 4.749967047246173e-05
385 4.5986063923919573e-05
386 4.451510903891176e-05
387 4.3099109461763874e-05
388 4.172445915173739e-05
389 4.0395709220319986e-05
390 3.911253952537663e-05
391 3.786665547522716e-05
392 3.6664536310127005e-05
393 3.5

in the last step, we still relied on manually updating the parameters using a simple updating procedure. But this can be automated using the optim package.

### optim package
This abstracts the idea of optmization problem and provides common algorithms which include SGD, Adam, AdaGrad, RMSProp.

To automate the optimization, define an optimizer object and use the Adam optimizer (in the example this is below the learning rate). Changes are marked in the example:

In [31]:
import torch

#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data (requires_grad=False is default). No need to keep track (i.e. no gradients) for
# these tensors
x = torch.randn(N, D_in, dtype=dtype, device=device,requires_grad=False)
y = torch.randn(N, D_out,dtype=dtype, device=device,requires_grad=False)

model = torch.nn.Sequential(  #sequence of layers
            torch.nn.Linear(D_in, H), #first layer. Each linear module holds internal tensors for weights and bias.
            torch.nn.ReLU(),
            torch.nn.Linear(H,D_out), #second layer
)

loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate =1e-4
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)

for t in range(500):
    #forward pass
    y_pred = model(x)
    
    #loss
    loss = loss_fn(y_pred,y)
    print(t, loss.item())
    
    #Note changed from model.zero_grad
    optimizer.zero_grad() #zero out the gradients before running the backward pass
    #now perform the backward pass and gradient computations
    loss.backward()
    
    #update parameters
    optimizer.step()

0 666.390380859375
1 649.1421508789062
2 632.373291015625
3 616.1592407226562
4 600.4215087890625
5 585.2346801757812
6 570.3975219726562
7 555.941650390625
8 541.8667602539062
9 528.2017211914062
10 514.948974609375
11 502.10211181640625
12 489.59429931640625
13 477.4076843261719
14 465.607666015625
15 454.10833740234375
16 442.9556579589844
17 432.12127685546875
18 421.6650695800781
19 411.53509521484375
20 401.6994934082031
21 392.1106262207031
22 382.727783203125
23 373.57318115234375
24 364.6459045410156
25 355.9159851074219
26 347.4058532714844
27 339.09979248046875
28 330.9993896484375
29 323.0949401855469
30 315.36859130859375
31 307.8323669433594
32 300.4770202636719
33 293.3368835449219
34 286.3800964355469
35 279.6339111328125
36 273.0387268066406
37 266.5888977050781
38 260.26708984375
39 254.08700561523438
40 248.0618438720703
41 242.158203125
42 236.38510131835938
43 230.71446228027344
44 225.13047790527344
45 219.6458282470703
46 214.25900268554688
47 208.99014282226562


378 2.7953487006016076e-05
379 2.6318073651054874e-05
380 2.4779486921033822e-05
381 2.3323169443756342e-05
382 2.1956459022476338e-05
383 2.066337037831545e-05
384 1.9445975340204313e-05
385 1.8297960195923224e-05
386 1.7218271750607528e-05
387 1.619778231543023e-05
388 1.523672563052969e-05
389 1.4330731573863886e-05
390 1.3479640074365307e-05
391 1.267469360755058e-05
392 1.1919504686375149e-05
393 1.1206476301595103e-05
394 1.0536680747463834e-05
395 9.903460522764362e-06
396 9.309099368692841e-06
397 8.747264473640826e-06
398 8.2205133367097e-06
399 7.72495423007058e-06
400 7.256750905071385e-06
401 6.8156800807628315e-06
402 6.402221060852753e-06
403 6.012487574480474e-06
404 5.646147656079847e-06
405 5.301898454490583e-06
406 4.976234777132049e-06
407 4.670742782764137e-06
408 4.385226020531263e-06
409 4.115542196814204e-06
410 3.861761342704995e-06
411 3.622835265559843e-06
412 3.398137550902902e-06
413 3.187852144037606e-06
414 2.9904163056926336e-06
415 2.8042065878253197e-06

### Custom nn modules
These are useful when we want something more than just a sequence of existing modules. Just as with custom autograd functions, we need to define subclasses of nn.Module and defining a forward function which receives a tensor input and produces tensor outputs using other modules or other autograd operations on tensors.  

In [33]:
import torch

class TwoLayerNet(torch.nn.Module): #inherit from class nn.Module
    
    def __init__(self,D_in,H,D_out): #input vars
        """
        instantiate the module as needed.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in,H)
        self.linear2 = torch.nn.Linear(H,D_out)
        
    def forward(self, x):
        """
        accept tensor x as input to produce output. Can use modules described above in constructor as well as arbitrary
        operators on tensors (relu here)
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data (requires_grad=False is default). No need to keep track (i.e. no gradients) for
# these tensors
x = torch.randn(N, D_in, dtype=dtype, device=device,requires_grad=False)
y = torch.randn(N, D_out,dtype=dtype, device=device,requires_grad=False)

#first change here, in model, use Custom module
model = TwoLayerNet(D_in, H,D_out)

loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(),lr =1e-4) #changed from Adam to SGD
for t in range(500):
    #forward pass
    y_pred = model(x)
    #compute and print the loss each epoch
    loss = loss_fn(y_pred,y)
    print(t, loss.item())
    #zero out, backprop, update
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 685.529296875
1 635.337890625
2 592.3157348632812
3 554.4987182617188
4 520.8770141601562
5 490.76776123046875
6 463.47772216796875
7 438.3726806640625
8 415.3731384277344
9 393.93536376953125
10 373.853271484375
11 355.0190124511719
12 337.3085021972656
13 320.48663330078125
14 304.4160461425781
15 289.02386474609375
16 274.3029479980469
17 260.16656494140625
18 246.68275451660156
19 233.74365234375
20 221.38206481933594
21 209.59825134277344
22 198.32432556152344
23 187.4888916015625
24 177.16595458984375
25 167.35552978515625
26 158.02586364746094
27 149.1493377685547
28 140.73497009277344
29 132.7492218017578
30 125.09747314453125
31 117.84297180175781
32 110.98898315429688
33 104.518310546875
34 98.40524291992188
35 92.64903259277344
36 87.22299194335938
37 82.10503387451172
38 77.28538513183594
39 72.72904968261719
40 68.43241882324219
41 64.39093780517578
42 60.586151123046875
43 57.01301956176758
44 53.64780044555664
45 50.489219665527344
46 47.516639709472656
47 44.725273132

379 0.00015568248636554927
380 0.00015118780720513314
381 0.00014682098117191344
382 0.00014258445298764855
383 0.00013846550427842885
384 0.00013447485980577767
385 0.00013060304627288133
386 0.00012684345711022615
387 0.00012318023073021322
388 0.00011964039003942162
389 0.00011619793076533824
390 0.00011285635264357552
391 0.00010960918007185683
392 0.00010646028385963291
393 0.00010340067819925025
394 0.00010043266229331493
395 9.754474012879655e-05
396 9.47538428590633e-05
397 9.20325837796554e-05
398 8.93963806447573e-05
399 8.683304622536525e-05
400 8.435217023361474e-05
401 8.193223038688302e-05
402 7.958578498801216e-05
403 7.730680226814002e-05
404 7.509643182856962e-05
405 7.295209798030555e-05
406 7.086463301675394e-05
407 6.883971218485385e-05
408 6.687553104711697e-05
409 6.496667629107833e-05
410 6.311532342806458e-05
411 6.131336704129353e-05
412 5.956390668870881e-05
413 5.7869056035997346e-05
414 5.622004027827643e-05
415 5.461672117235139e-05
416 5.3061256039654836e-

### Dynamic Computation Graphs
Example of a dynamic model in which the number of hidden layers varies randomly between 1 and 4. The same (i.e. not updated) weights (associated with the layer being repeated) are reused multiple times to compute the innermost hidden layers (**weight sharing**).

Use normal python flow control (loop). Implement weight sharing by resuing the same module multiple times when defining the forward pass. 

**note:** plain vanilla SGD is hard to train in this case (why?). The example uses the **momentum** parameter when defining the optimizer.

In [36]:
import random
import torch

class DynamicNet(torch.nn.Module):
    
    def __init__(self, D_in, H,D_out):
        """
        In this constructor 3 linear instances are used for the forward pass
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in,H)
        self.middle_linear = torch.nn.Linear(H,H)
        self.output_linear = torch.nn.Linear(H,D_out)
    
    def forward(self, x):
        """
        randomly choose 0,1,2,3 and reuse the middle layer that many times to compute 
        various hidden layer representations.
        
        since each forward pass builds a dynamic computation graph, we can use flow control right here in the 
        implementation of the forward method.
        
        A for loop is used here, but there are no retrictions on conditional statements
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0,3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        
        y_pred = self.output_linear(h_relu)
        return y_pred

#the rest of the script is similar to example above (just change the subclass name)
#N is batch size (for data) and D_in is the dimension for input.
#H is number of hidden dimensions, D_out is output dimension

N, D_in, H, D_out = 64, 1000, 100, 10 

#create random input and output data (requires_grad=False is default). No need to keep track (i.e. no gradients) for
# these tensors
x = torch.randn(N, D_in, dtype=dtype, device=device,requires_grad=False)
y = torch.randn(N, D_out,dtype=dtype, device=device,requires_grad=False)

#first change here, in model, use Custom module
model = DynamicNet(D_in, H,D_out)

loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(),lr =1e-4, momentum = 0.9) #Momentum used.
for t in range(500):
    #forward pass
    y_pred = model(x)
    #compute and print the loss each epoch
    loss = loss_fn(y_pred,y)
    print(t, loss.item()) ##extract scalar for printing using .item()
    #zero out, backprop, update
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


0 695.0001831054688
1 658.2896728515625
2 661.1303100585938
3 654.8280029296875
4 645.146728515625
5 639.4740600585938
6 484.40386962890625
7 444.313232421875
8 638.9232177734375
9 619.7312622070312
10 634.03466796875
11 610.7426147460938
12 282.1170654296875
13 600.6173706054688
14 229.688232421875
15 196.1826934814453
16 598.949462890625
17 128.1256866455078
18 101.63780212402344
19 81.0823745727539
20 66.53426361083984
21 547.3988647460938
22 52.46210479736328
23 605.676513671875
24 599.3804931640625
25 502.1092529296875
26 481.9416809082031
27 463.1158142089844
28 541.719482421875
29 391.3884582519531
30 342.04534912109375
31 450.7575378417969
32 244.20376586914062
33 303.4327697753906
34 271.57611083984375
35 255.37318420410156
36 141.01829528808594
37 120.36034393310547
38 82.71419525146484
39 89.44031524658203
40 178.10093688964844
41 69.85724639892578
42 62.8962287902832
43 130.2949981689453
44 74.0921630859375
45 67.6636962890625
46 55.073184967041016
47 186.94583129882812
48 

377 1.1520771980285645
378 2.59019136428833
379 1.212820291519165
380 0.786175012588501
381 1.004677414894104
382 0.7920777797698975
383 1.598036289215088
384 0.20308414101600647
385 0.6949675679206848
386 0.6581180691719055
387 0.6273066997528076
388 0.4464653730392456
389 0.4420221447944641
390 0.3231472074985504
391 0.6047117710113525
392 0.5109692215919495
393 1.428723931312561
394 0.3052345812320709
395 0.4920048117637634
396 0.477362722158432
397 0.33029085397720337
398 0.3351137042045593
399 0.31951916217803955
400 0.4097343683242798
401 0.3208383321762085
402 1.2451120615005493
403 0.2872730493545532
404 0.24102036654949188
405 0.18371833860874176
406 0.615006685256958
407 0.420208603143692
408 0.9909774661064148
409 0.11718960106372833
410 0.12668126821517944
411 0.12016744166612625
412 1.1136740446090698
413 0.48540782928466797
414 0.07076679915189743
415 0.48021015524864197
416 0.47492921352386475
417 0.09421706944704056
418 0.8413783311843872
419 0.07469574362039566
420 0.5