### Extra on Training

Remember that with training, we first have our neural network make a prediction, like we did above.

in training we'll do the following.  We'll take our set of images, and for each image, we'll input it into our neural network, have the neural network make a prediction, and then update the neural network based on how off the prediction was.  So if the neural network said there was a $.05$ percent chance of the picture being a hat, but the picture was in fact a hat -- the neural network will be updated a lot, and if it predicted a $.95$ percent chance of our hat being a hat, then we would not update our neural network so much.

What does it mean to update a neural network?  Well let's take a look at our neural network again.

So we can see that our neural network predicts that the first image is what's at image 4 -- or a coat.  In future lessons, we'll better understand this prediction function.  For now, let's move onto training the neural network.

In [94]:
net

Net(
  (W1): Linear(in_features=784, out_features=64, bias=True)
  (W2): Linear(in_features=64, out_features=64, bias=True)
  (W3): Linear(in_features=64, out_features=64, bias=True)
  (W4): Linear(in_features=64, out_features=10, bias=True)
)

Each one of those `W's` represents a grid of numbers.  Let's see one.

In [95]:
net.W3._parameters

OrderedDict([('weight',
              Parameter containing:
              tensor([[ 0.0259, -0.0138, -0.0259,  ..., -0.0757, -0.0621, -0.0298],
                      [ 0.1063,  0.0534, -0.0855,  ..., -0.0165, -0.0313, -0.0833],
                      [ 0.1078,  0.0445,  0.0239,  ..., -0.0503,  0.0975, -0.0178],
                      ...,
                      [ 0.1124,  0.0989, -0.0902,  ..., -0.0939, -0.0652,  0.1107],
                      [ 0.1120, -0.0405,  0.0659,  ..., -0.0038, -0.0218, -0.0882],
                      [ 0.1226,  0.0628, -0.0597,  ..., -0.0875,  0.0712,  0.1169]],
                     requires_grad=True)),
             ('bias',
              Parameter containing:
              tensor([-0.0495,  0.0637, -0.1004, -0.0250, -0.0909,  0.1096,  0.0405,  0.0485,
                       0.0361, -0.1112,  0.0071,  0.0476,  0.1005,  0.0198, -0.0114,  0.0470,
                      -0.0566, -0.0448,  0.0945, -0.0279, -0.0823,  0.1084,  0.1233,  0.1245,
                      -0.

That's a lot of numbers.  

> And remember, these are just the numbers that consist of one of those `W's`.  There are four of them.

Depending on how far off the neural network's ultimate prediction is, those numbers will be updated with each prediction the neural network sees.  

It takes a little bit of setup to go through this, but still let's see it in action.

> We use something called cross entropy loss, and an optimizer -- both oof which we'll learn more about later.

In [2]:
x_loss = nn.CrossEntropyLoss()
x_loss
# CrossEntropyLoss()

import torch.optim as optim
adam = optim.Adam(net.parameters(), lr=0.0005)

And then if we pass through the neural network's *prediction*, and the corresponding label, we can see how far off we were.

In [97]:
pred = net(first_obs)
pred

first_label = y_train_tensor[0]
first_label

loss = x_loss(pred, first_label)
loss

tensor(2.2632, grad_fn=<NllLossBackward>)

Then we can calculate how to update our neural network, and finally, make the update with the `step` function.

In [98]:
loss.backward()
adam.step()

With that our neural network has been updated, just a little bit, based on it's poor prediction of the first image.  Let's see this.

> Notice that the first number was updated from $.0259$ to $.03$

In [100]:
net.W3._parameters

OrderedDict([('weight',
              Parameter containing:
              tensor([[ 0.0309, -0.0088, -0.0209,  ..., -0.0707, -0.0571, -0.0248],
                      [ 0.1013,  0.0484, -0.0905,  ..., -0.0215, -0.0363, -0.0883],
                      [ 0.1128,  0.0495,  0.0289,  ..., -0.0453,  0.1025, -0.0128],
                      ...,
                      [ 0.1174,  0.1039, -0.0852,  ..., -0.0889, -0.0602,  0.1157],
                      [ 0.1170, -0.0355,  0.0709,  ...,  0.0012, -0.0168, -0.0832],
                      [ 0.1276,  0.0678, -0.0547,  ..., -0.0825,  0.0762,  0.1219]],
                     requires_grad=True)),
             ('bias',
              Parameter containing:
              tensor([-0.0445,  0.0587, -0.0954, -0.0200, -0.0859,  0.1046,  0.0355,  0.0435,
                       0.0411, -0.1062,  0.0021,  0.0426,  0.1055,  0.0148, -0.0164,  0.0420,
                      -0.0616, -0.0398,  0.0995, -0.0329, -0.0773,  0.1134,  0.1283,  0.1295,
                      -0.