## Gradient Descent

Let us implement training a neural network using gradient descent in PyTorch.

PyTorch utilizes its autograd mechanism [docs](https://pytorch.org/docs/stable/notes/autograd.html) to calculate the gradients for every parameter in a computaton graph automatically, given an error.
For this we will:
- Calculate the output of a neuron given its current weights
- Calculate the error with a given label
- Let PyTorch figure out the gradients using .backward()
- apply the gradients with $w \leftarrow w - \alpha * w.grad$

In [None]:
# Let's start by importing the relevant packages
# matplotlib for plots
import matplotlib as mpl
from matplotlib import pyplot as plt
# pandas to read in some data
import pandas as pd
# numpy to build our first perceptron
import numpy as np
# Train test split to do validate our findings from the perceptron training
from sklearn.model_selection import train_test_split
# MinMaxScaler to normalise the data before inputting them to the perceptron
from sklearn.preprocessing import MinMaxScaler
# PyTorch for neural networks
import torch
import time
from torch import nn
%matplotlib inline
mpl.rcParams['figure.figsize'] = (16, 9)
import os
home = os.path.expanduser("~")
data = home + '/data/workshop_data/occupancy_data/datatraining.txt'
# Let us load the data from the previous example
df = pd.read_csv(data)

target = 'Occupancy'
features = [col for col in df.columns if target not in col and 'date' not in col]

In [None]:
x_train, x_val, y_train, y_val = train_test_split(df[features], df[target])
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_val = scaler.transform(x_val)

## Build the artificial neuron
To build and train a neuron we have to perform three steps:
- Calculate the perceptron's output $\hat{y} = \sigma\left(\sum_i w_i X_i)\right)$
- Determine the error: $E(w) = \frac12 \sum_{(x,y) \in D} (y-a)^2$
- Calculate the weight gradient with: $\sum_{(x,y) \in D} (y-a)$
- Repeat the above steps until there occur no more updates (we will iterate once over the dataset instead)

PyTorch abstracts neural networks using the nn.Module class. Every neural network has to subclass from it for PyTorch's mechanisms to work perfecty. In addition, every layer has to be a member of the network's class. Otherwise the weights do not appear as parameters of the network. Let us start by building a single neuron within a PyTorch module.

In [None]:
class Neuron(nn.Module):
    
    def __init__(self, number_of_inputs):
        super().__init__()
        # Build the neuron using nn.Linear
        self.neuron = nn.Linear(number_of_inputs, 1, bias=True)
        # use nn.Sigmoid as an activation function
        self.act = nn.Sigmoid()
    
    def logit(self, inp):
        return self.neuron(inp)
    
    def forward(self, inp):
        return self.act(self.logit(inp))
    

Let us now select a random selection of the training data and calculate the gradients for the neuron:

In [None]:
loss = nn.BCEWithLogitsLoss()
neuron = Neuron(5)
select = np.random.randint(0, len(x_train), 2014)
x = torch.from_numpy(x_train[select]).float()
y = torch.from_numpy(y_train.iloc[select].values).float().unsqueeze(1)
y_logits = neuron.logit(x)
err = loss(y_logits, y)
err.backward()
for name, param in neuron.named_parameters():
    print(f'Parameter {name}\n{param}\nGradient {param.grad}')
    param = param - 5e-2*param.grad


In [None]:
optim = torch.optim.SGD(neuron.parameters(), lr=5e-2)

In [None]:
def fit_batch(optim, loss, neuron, x, y):
    optim.zero_grad()
    y_pred = neuron.logit(x)
    #print(y, y_pred, y.sum())
    err = loss(y_pred, y)
    #err = err * (y * 3 + 1)
    err.mean().backward()
    optim.step()
    return y_pred

def eval_batch(neuron, x):
    y_pred = neuron.logit(x)
    return y_pred

start = time.time()  
for i in range(20):
    acc = None
    for i in range(200):
        select = np.random.randint(0, len(x_train), 2048)
        x = torch.from_numpy(x_train[select]).float()
        y = torch.from_numpy(y_train.iloc[select].values).float().unsqueeze(1)
        y_pred = fit_batch(optim, loss, neuron, x, y)
        if acc is None:
            acc = (y==(y_pred > .5).float()).float().mean()
        else:
            acc += (y==(y_pred > .5).float()).float().mean()
        #y_pred = y_pred.argmax(dim=-1)
        #acc += (y==y_pred).float().mean()
    print(f'accuracy {acc/200}')
print(f'Training time: {time.time() - start}')


Why did we use the logits function instead of calling forward including the sigmoid function?
Chaining a Sigmoid and the Cross Entropy Loss can lead to instabilities, if calculated numerically. 
This can be solved analytically and is done directly in the BCELoss function.

## Move the neuron to the GPU
PyTorch tensors and modules allow us to call .cuda() on them to move the computations to the GPU.
This makes it really easy to perform any calculation on the GPU (which is super handy even if you do not use neural networks.


In [None]:
if torch.cuda.is_available():
    neuron = Neuron(5).cuda()
    optim = torch.optim.SGD(neuron.parameters(), lr=5e-2)
    start = time.time()
    for i in range(20):
        acc = None
        for i in range(200):
            select = np.random.randint(0, len(x_train), 2048)
            x = torch.from_numpy(x_train[select]).float().cuda()
            y = torch.from_numpy(y_train.iloc[select].values).float().unsqueeze(1).cuda()
            y_pred = fit_batch(optim, loss, neuron, x, y)
            if acc is None:
                acc = (y==(y_pred > .5).float()).float().mean()
            else:
                acc += (y==(y_pred > .5).float()).float().mean()
        print(f'accuracy {acc.data.cpu().numpy()/200}')
    print(f'Training time: {time.time() - start}')
    

## Why is the GPU version slower?

Well, we need to move the data to the GPU and back. This costs us time. It normally pays off, as the computations take way longer than moving the data. In our current case the computation is very simple and the amount of data very small. This nothing the GPU is well suited for, because it can not use its advantage of performing a lot of computations in parallel.