<a href="https://colab.research.google.com/github/jsqihui/ai/blob/master/2020-12-16-SGD-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SGD 实验2
> SGD的应用以及pytorch的基本使用 (基于fast.ai lesson 4)

- toc:true
- branch: master
- badges: true
- comments: true
- author: jsqihui
- categories: [fast.ai]



In [1]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

from fastai.vision.all import *
from fastbook import *

matplotlib.rc('image', cmap='Greys')

[K     |████████████████████████████████| 727kB 5.8MB/s 
[K     |████████████████████████████████| 51kB 7.2MB/s 
[K     |████████████████████████████████| 1.1MB 30.1MB/s 
[K     |████████████████████████████████| 194kB 42.8MB/s 
[K     |████████████████████████████████| 61kB 8.2MB/s 
[?25hMounted at /content/gdrive


In [2]:
#hide
path = untar_data(URLs.MNIST_SAMPLE)
Path.BASE_PATH = path
threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()
seven_tensors = [tensor(Image.open(o)) for o in sevens]
three_tensors = [tensor(Image.open(o)) for o in threes]
stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255
valid_3_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'3').ls()])
valid_3_tens = valid_3_tens.float()/255
valid_7_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'7').ls()])
valid_7_tens = valid_7_tens.float()/255

In [3]:
#hide
train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)
train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
dset = list(zip(train_x,train_y))
valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x,valid_y))

In [4]:
x, y = dset[0]
x.shape, y.shape

(torch.Size([784]), torch.Size([1]))

这里把每一幅图28*28pixel变成一个vector，也就是[1x784]的一个tensor。label变成[1x1]的tensor

利用的architecture是 (Weight*ImageTensor).sum()+bias。
> Tip: 这里的weight和ImageTensor是bitwise multiply，也就是每一个pixel有一个weight，得出来的tensor还是[1x784]的大小

In [19]:
def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()
weights = init_params((28*28,1))
bias = init_params(1)

In [20]:
(train_x[0]*weights.T).sum() + bias

tensor([15.7077], grad_fn=<AddBackward0>)

In [None]:
利用pytorch的GPU功能来计算"@"表示是矩阵乘法

In [21]:
def linear1(xb): return xb@weights + bias
preds = linear1(train_x)

其实这里preds>0.0就是loss function，这里的0是随意选取的，没有太大意义。

In [22]:
corrects = (preds>0.0).float() == train_y
corrects.float().mean().item()

0.5525975823402405

In [23]:
weights[0] *= 1.0001
preds = linear1(train_x)
((preds>0.0).float() == train_y).float().mean().item()

0.5525975823402405

从[SGD实验1](https://jsqihui.github.io/ai/fast.ai/2020/12/16/SGD-1.html)里，我们知道我们的最终目的是得出合适的parameter，这里也就是weights和bias，来带入architecture做预测。要得出合适的parameter，就是要通过计算parameter的gradient,然后通过 parameter = parameter - (learning rate) * gradient来更新parameter。而gradient是通过loss function的derivative来求得的。但是有的loss function当parameter是0的时候，它的derivative也是0，比如说 loss = x^2。这样的话parameter就不会被更新。我们希望找到一种loss function，使得只要跟新一点点weight使得prediction变好一点，它的loss也会变小一点。

Having defined a loss function, now is a good moment to recapitulate why we did this. After all, we already had a metric, which was overall accuracy. So why did we define a loss?

The key difference is that the metric is to drive human understanding and the loss is to drive automated learning. To drive automated learning, the loss must be a function that has a meaningful derivative. It can't have big flat sections and large jumps, but instead must be reasonably smooth. This is why we designed a loss function that would respond to small changes in confidence level. This requirement means that sometimes it does not really reflect exactly what we are trying to achieve, but is rather a compromise between our real goal, and a function that can be optimized using its gradient. The loss function is calculated for each item in our dataset, and then at the end of an epoch the loss values are all averaged and the overall mean is reported for the epoch.

Metrics, on the other hand, are the numbers that we really care about. These are the values that are printed at the end of each epoch that tell us how our model is really doing. It is important that we learn to focus on these metrics, rather than the loss, when judging the performance of a model.



下面这个loss function就符合要求，先把所有的prediction用sigmoid变到[0,1]内，然后直接计算距离的mean值。上面这段是直接从fast.ai的书里摘录出来的，对理解metric和loss function的区别很有帮助。

In [31]:
weights = init_params((28*28,1))
bias = init_params(1)

In [32]:
def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid() # torch的function
    return torch.where(targets==1, 1-predictions, predictions).mean()

In [33]:
dl = DataLoader(dset, batch_size=256)
valid_dl = DataLoader(valid_dset, batch_size=256)

把所有的合并成一个function，计算preds，然后计算loss，然后通过loss计算gradient

In [34]:
def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = mnist_loss(preds, yb)
    loss.backward()

当calc_grad计算出gradient后，更新parameter。并重置gradient
> Warning: 为何要重置gradient？

In [35]:
def train_epoch(model, lr, params):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad*lr
            p.grad.zero_()

In [36]:
def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds>0.5) == yb
    return correct.float().mean()

In [37]:
def validate_epoch(model):
    accs = [batch_accuracy(model(xb), yb) for xb,yb in valid_dl]
    return round(torch.stack(accs).mean().item(), 4)

In [38]:
lr = 1.
params = weights,bias
train_epoch(linear1, lr, params)
validate_epoch(linear1)

0.707

In [39]:
for i in range(20):
    train_epoch(linear1, lr, params)
    print(validate_epoch(linear1), end=' ')

0.8539 0.9081 0.9291 0.9365 0.9443 0.9482 0.954 0.9589 0.9609 0.9623 0.9633 0.9662 0.9667 0.9682 0.9682 0.9687 0.9687 0.9692 0.9702 0.9702 

# pytorch提供了很多有用的function，免去了自己去创造

In [40]:
linear_model = nn.Linear(28*28,1)
class BasicOptim:
    def __init__(self,params,lr): self.params,self.lr = list(params),lr

    def step(self, *args, **kwargs):
        for p in self.params: p.data -= p.grad.data * self.lr

    def zero_grad(self, *args, **kwargs):
        for p in self.params: p.grad = None
opt = BasicOptim(linear_model.parameters(), lr)
def train_epoch(model):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()

In [41]:
def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_epoch(model), end=' ')

In [42]:
train_model(linear_model, 20)

0.4932 0.8335 0.8438 0.9116 0.9341 0.9482 0.9565 0.9624 0.9658 0.9678 0.9692 0.9717 0.9736 0.9746 0.9761 0.9761 0.9775 0.978 0.9785 0.979 

fastbook也提供了一些wrapper，比如说SGD

In [43]:
dls = DataLoaders(dl, valid_dl)
learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

In [44]:
learn.fit(10, lr=lr)

epoch,train_loss,valid_loss,batch_accuracy,time
0,0.637081,0.503588,0.495584,00:00
1,0.566746,0.186157,0.844946,00:00
2,0.206826,0.185936,0.832188,00:00
3,0.089577,0.108501,0.910697,00:00
4,0.046465,0.078758,0.932287,00:00
5,0.029719,0.062852,0.946025,00:00
6,0.022896,0.053008,0.954367,00:00
7,0.019905,0.046501,0.962218,00:00
8,0.018413,0.041952,0.965162,00:00
9,0.017528,0.038611,0.967125,00:00
