<a href="https://colab.research.google.com/github/hllorens/ai-sandbox/blob/main/HF_course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to train a model with Pytorch the hard way
```python
from torch.contributors import mark_saroufim
```

```python
> mark_saroufim.projects
[torchserve, torchX, Pytorch Enterprise, ...]
```

```python
# For more from mark
> mark_saroufim.twitter_account()
twitter.com/marksaroufim
```


In [None]:
import torch

# What is Machine Learning

Well what's supervised learning

Data -> Model -> Prediction

`f(x) -> y`

compare `y and y'` and update `f`

what is `x`? Something that can fit in an excel spreadsheet



## How to do Machine Learning


1. Use an open dataset (imdb, imagenet etc..)
2. Use an open model (copy model architecture from famous paper or Github)
3. Use a pretrained model (use a model hub like HF hub, torch hub etc..)
4. Use a model trainer (hf training loop, ptl loop, ignite, fast.ai keras etc..)
5. **Train a model from scratch** (This talk!)

Nowadays we have

Training loops: `model.fit(data)`

Auto ML: `fit(data)`

Pretrained models: `finetune(data)`

In [None]:
! pip install transformers -q

[K     |████████████████████████████████| 3.1 MB 4.2 MB/s 
[K     |████████████████████████████████| 895 kB 41.7 MB/s 
[K     |████████████████████████████████| 3.3 MB 43.2 MB/s 
[K     |████████████████████████████████| 59 kB 6.2 MB/s 
[K     |████████████████████████████████| 596 kB 47.9 MB/s 
[?25h

In [None]:
# HF model hub code
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Model by the deepset.ai team
tokenizer = AutoTokenizer.from_pretrained("deepset/minilm-uncased-squad2")

model = AutoModelForQuestionAnswering.from_pretrained("deepset/minilm-uncased-squad2")

Downloading:   0%|          | 0.00/107 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/477 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/127M [00:00<?, ?B/s]

In [None]:
# Neural networks don't know what a string is
tokenized_sentence = tokenizer.tokenize("Is Mark clear?")
print(tokenized_sentence)

['is', 'mark', 'clear', '?']


In [None]:
# Neural networs don't know what strings are
# For images you can use pixels
# For text yo
tokenized_sentence = tokenizer("What is my name?", return_tensors='pt', truncation=True)
model(**tokenized_sentence)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


QuestionAnsweringModelOutput([('start_logits',
                               tensor([[ 1.7510, -5.8749, -5.8916, -5.0617, -6.2590, -6.0950,  1.7509]],
                                      grad_fn=<CloneBackward0>)),
                              ('end_logits',
                               tensor([[ 2.3407, -4.9248, -5.6527, -5.0064, -5.1805, -4.7368,  2.3406]],
                                      grad_fn=<CloneBackward0>))])

In [None]:
output = model(**tokenized_sentence)

In [None]:
import numpy as np
predictions = np.argmax(output.items, axis=-1)
predictions

0

:(

# Curve fitting
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html


In [None]:
# Scikit learn
import numpy as np
from sklearn.linear_model import LinearRegression

# Fit a line

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)

In [None]:
reg.score(X, y)


1.0

In [None]:
reg.coef_


array([1., 2.])

In [None]:
reg.intercept_

3.0000000000000018

In [None]:

reg.predict(np.array([[3, 5]]))

array([16.])

# Training loops

Note: Same idea for the HuggingFace trainer or Ignite or fast.ai or Keras any other popular trainer you like

https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html


> Set a model and training data data

```python
model = MyLightningModule()

trainer = Trainer()

# model can be pretrained as well in which case .fit() is really .finetune()
trainer.fit(model, train_dataloader, val_dataloader)
```

> Data loader https://github.com/pytorch/data

`train_dataset` could be csv, s3 bucket, binary data, images etc..

```python
def train_dataloader(self):
    return torch.utils.data.DataLoader(self.train_dataset)
```

Can be made iteratable so you can do

```
for batch in train_data_loader():
  model(batch)
```

> `fit()` code looks like

```python
# put model in train mode
model.train()
torch.set_grad_enabled(True)

losses = []
for batch in train_dataloader:
    # calls hooks like this one
    on_train_batch_start()

    # train step
    loss = training_step(batch)

    # clear gradients
    optimizer.zero_grad()

    # backward
    loss.backward()

    # update parameters
    optimizer.step()

    losses.append(loss)
```

# Training loop in English

1. Find out how wrong your model is (loss)
2. Find out by how wrong each layer is (loss.backward)
3. Update your model (optimizer.step)

# Why Train a model from scratch?
1. Can debug issues more easily
2. Can do ML research
3. You don't have a choice
4. It's fun!

![occult](https://upload.wikimedia.org/wikipedia/commons/1/1b/A_Magician_by_Edward_Kelly.jpg)

Engraving of occultists John Dee and Edward Kelley "in the act of invoking the spirit of a deceased person"; from Astrology (1806) by Ebenezer Sibly. CC


# What is Pytorch?

PyTorch is a Python package that provides two high-level features:

1. Tensor computation (like NumPy) with strong GPU acceleration
2. Deep neural networks built on a tape-based autograd system

In [None]:
class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()

        self.linear1 = torch.nn.Linear(100, 1000)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(1000, 10)
        self.softmax = torch.nn.Softmax()

    def forward(self, x):
        # linear
        x = self.linear1(x)
        # non linear
        x = self.activation(x)
        #linear
        x = self.linear2(x)
        # non linear
        x = self.softmax(x)
        return x

In [None]:
model = TinyModel()
print(model)

TinyModel(
  (linear1): Linear(in_features=100, out_features=1000, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=1000, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


# Why Linear Layers
* Attention includes 3 Linear layers
* RNN is implemented using many linear layers
* CNN can be implemented using Linear layers

Different layer types have different inductive biases (fancy term for are good at certain kinds of tasks) but foundationally in terms of ops there's a few that come up over and over again

In [None]:
print(model)

TinyModel(
  (linear1): Linear(in_features=100, out_features=1000, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=1000, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


It's all just matrix multiplication

1. in_features = 100
2. hidden_features = 1000
3. Nonlinearity (ReLu and Softmax activation functions)
4. num_classes = 10
5. Pick the most likely class softmax

In [None]:
# ReLU gets rid of negative values
m = torch.nn.ReLU()
input = torch.randn(4)
print(f"input is {input}")
output = m(input)
print(f"output is {output}")

input is tensor([ 1.0583, -0.7064, -0.9405, -0.1706])
output is tensor([1.0583, 0.0000, 0.0000, 0.0000])


In [None]:
# Softmax to pick a class
m = torch.nn.Softmax(dim=-1)
input = torch.randn(4)
print(f"input is {input}")
output = m(input)
print(f"output is {output}")
sum = output.sum(dim=-1)
print(f"sum is {sum}")

input is tensor([-0.2975, -0.3973, -2.3170, -0.8353])
output is tensor([0.3814, 0.3452, 0.0506, 0.2228])
sum is 0.9999998807907104


Other important layer types
* `torch.nn.Conv2D` (get information from a region in a tensor)
* `torch.nn.LSTM` (if you want your model to remember things)

In [None]:
! pip install rich -q

[K     |████████████████████████████████| 213 kB 4.0 MB/s 
[K     |████████████████████████████████| 51 kB 6.9 MB/s 
[?25h

In [None]:
from rich import inspect

In [None]:
inspect(torch.nn.ReLU)

In [None]:
inspect(torch.nn.Linear)

In [None]:
inspect(torch.nn.Softmax)

In [None]:
for param in model.parameters():
  print(param)

Parameter containing:
tensor([[-0.0815,  0.0615,  0.0646,  ...,  0.0222,  0.0194,  0.0032],
        [ 0.0126,  0.0323, -0.0114,  ...,  0.0016,  0.0956, -0.0300],
        [-0.0907,  0.0732, -0.0074,  ..., -0.0555,  0.0403, -0.0816],
        ...,
        [ 0.0669, -0.0596,  0.0222,  ..., -0.0236,  0.0687,  0.0611],
        [-0.0533, -0.0475, -0.0621,  ...,  0.0366, -0.0086,  0.0843],
        [-0.0226,  0.0020,  0.0798,  ..., -0.0105, -0.0931,  0.0431]],
       requires_grad=True)
Parameter containing:
tensor([-8.0451e-02,  2.6209e-02, -3.9328e-02, -6.7743e-02,  1.6366e-02,
        -9.0035e-02,  1.7493e-02,  6.1228e-02,  9.4750e-02, -3.8858e-02,
         5.7701e-02, -4.0638e-02, -1.1974e-02,  2.3735e-02, -8.3661e-02,
         3.9129e-02,  3.0327e-03, -4.7050e-02,  9.7114e-02, -2.1298e-03,
        -3.1234e-02,  9.5966e-03, -4.8852e-02, -7.6519e-03, -2.2535e-02,
         1.7180e-02,  5.4345e-02, -1.4042e-03,  7.9103e-02, -1.6878e-02,
        -5.7365e-03, -9.6197e-02,  6.7781e-02, -7.3764e-0

In [None]:
for param in model.linear1.parameters():
  print(param)

Parameter containing:
tensor([[-0.0815,  0.0615,  0.0646,  ...,  0.0222,  0.0194,  0.0032],
        [ 0.0126,  0.0323, -0.0114,  ...,  0.0016,  0.0956, -0.0300],
        [-0.0907,  0.0732, -0.0074,  ..., -0.0555,  0.0403, -0.0816],
        ...,
        [ 0.0669, -0.0596,  0.0222,  ..., -0.0236,  0.0687,  0.0611],
        [-0.0533, -0.0475, -0.0621,  ...,  0.0366, -0.0086,  0.0843],
        [-0.0226,  0.0020,  0.0798,  ..., -0.0105, -0.0931,  0.0431]],
       requires_grad=True)
Parameter containing:
tensor([-8.0451e-02,  2.6209e-02, -3.9328e-02, -6.7743e-02,  1.6366e-02,
        -9.0035e-02,  1.7493e-02,  6.1228e-02,  9.4750e-02, -3.8858e-02,
         5.7701e-02, -4.0638e-02, -1.1974e-02,  2.3735e-02, -8.3661e-02,
         3.9129e-02,  3.0327e-03, -4.7050e-02,  9.7114e-02, -2.1298e-03,
        -3.1234e-02,  9.5966e-03, -4.8852e-02, -7.6519e-03, -2.2535e-02,
         1.7180e-02,  5.4345e-02, -1.4042e-03,  7.9103e-02, -1.6878e-02,
        -5.7365e-03, -9.6197e-02,  6.7781e-02, -7.3764e-0

# Other kinds of architectures
* Transformers: https://pytorch.org/tutorials/beginner/transformer_tutorial.html
* Convolutional Neural Networks
* Recurrent Networks
* Anything you can dream of

## Activation functions

In [None]:
gelu = torch.nn.GELU()
swish = torch.nn.Hardswish()
softmax = torch.nn.Softmax()
input = torch.randn(4)

In [None]:
output = gelu(input)
print(output)

tensor([-0.1693,  0.4294,  0.8428, -0.1299])


In [None]:
output = swish(input)
print(output)

tensor([-0.2949,  0.3554,  0.6678, -0.1593])


In [None]:
output = torch.nn.Softmax(input)
print(output)

Softmax(dim=tensor([ 0.6985, -1.0728,  1.0635, -0.1885]))


## Torch tensors

In [None]:
# for GPU change this to device="cuda"
# same for other devices
a = torch.randn(5, device="cpu")

In [None]:
# Pay special attention to device, dtype, shape
inspect(a)

# How to make an inference with Pytorch

In [None]:
net = TinyModel()

In [None]:
print(net)

TinyModel(
  (linear1): Linear(in_features=100, out_features=1000, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=1000, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


In [None]:
input = torch.randn(100)

In [None]:
output = net(input)



In [None]:
print(output)

tensor([0.0841, 0.1325, 0.0935, 0.0916, 0.1114, 0.0928, 0.0922, 0.1043, 0.1086,
        0.0889], grad_fn=<SoftmaxBackward0>)


In [None]:
print(torch.max(output))
print(torch.argmax(output))

tensor(0.1325, grad_fn=<MaxBackward1>)
tensor(1)


In [None]:
assert net(input)[0] == net.forward(input)[0]



## How to train a model with Pytorch
1. Model: Layers, activation functions
2. Loss: the goal
3. Optimizer: how to reach the goal

Formula for Categorical Cross Entropy

$Loss = - \sum_{i}^{outputs} y_i \cdot log \hat{y_i} $

where

$y_i$ is the target value of the $i$'th output

$\hat{y_i}$ is the $i$'th value in the model output

$outputs$ is the number scalar values in the model output

In [None]:
import torch.optim as optim

criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.0001, momentum=0.9)

# Batch inference
Turn for loops into tensors

In [None]:
a = torch.rand(2)
print(a)

tensor([0.0116, 0.1588])


In [None]:
a_batch = torch.rand(3,2)
print(a_batch)

tensor([[0.7440, 0.0824],
        [0.4168, 0.8206],
        [0.0141, 0.6375]])


In [None]:
b = [0] * 3
for i in range(3):
  b[i] = a * 2
print(b)

[tensor([0.0232, 0.3177]), tensor([0.0232, 0.3177]), tensor([0.0232, 0.3177])]


In [None]:
# Broadcast matrix to scalar matrix multiplication
a_batch * 2

tensor([[1.4880, 0.1649],
        [0.8337, 1.6413],
        [0.0282, 1.2750]])

In [None]:
# Single operation
# [2x3] * [3*1]
b_batch = torch.randn(3,1)
a_batch * b_batch

tensor([[-0.2391, -0.0265],
        [-0.1302, -0.2564],
        [-0.0139, -0.6278]])

One reason why GPUs are so important for Machine Learning

In [None]:
BATCH_SIZE = 5
NUM_CLASSES = 10

# Switch to GPU
Edit -> Notebook Settings -> Hardware Accelerator -> GPU

In [None]:
# Stack BATCH_SIZE number of vectors on top of each other
inputs = torch.randn(BATCH_SIZE, 100, requires_grad=True, device="cuda")
targets = torch.empty(BATCH_SIZE, dtype=torch.long, device="cuda").random_(NUM_CLASSES)
output = criterion(inputs, targets)


In [None]:
print(f"target shape is {targets.shape}")
print(f"input shape is {inputs.shape}")
print(f"loss is is {output}")

target shape is torch.Size([5])
input shape is torch.Size([5, 100])
loss is is 5.584603309631348


In [None]:
! pip install tqdm -q
import tqdm

In [None]:
# Setup model
net = TinyModel()
net.to(device="cuda")

# Loss function
criterion = torch.nn.CrossEntropyLoss()

# Setup optimizer with net parameters and learning rate
optimizer = optim.SGD(net.parameters(), lr=0.0001)

NUM_EPOCHS = 5000
running_loss = 0

# tqdm for progress bars and throughput estimates
for epoch in range(NUM_EPOCHS):

      # zero the parameter gradients or don't
      optimizer.zero_grad()

      # Calculate the loss or don't
      outputs = net(inputs)
      loss = criterion(outputs, targets)

      # Calculate gradients or don't
      loss.backward()

      # Apply gradients or don't
      optimizer.step()

      # print statistics
      running_loss += loss.item()

      if epoch % 1000 == 0:
        print(f"epoch {epoch}: Running Loss {running_loss}")
      running_loss = 0



epoch 0: Running Loss 2.320944309234619
epoch 1000: Running Loss 2.290494441986084
epoch 2000: Running Loss 2.2350926399230957
epoch 3000: Running Loss 2.130263566970825
epoch 4000: Running Loss 1.9768211841583252


# Gradients
Elaborate on gradients
* `optimizer.zero_grad()`
* `loss.backward()`
* `optimizer.step()` -> given some parameters - return new parameters

In [None]:
# No magic - pick any loss function you want
loss = (inputs[0] - targets[0]).sum()
loss.backward() # backward pass

In [None]:
print(loss)

tensor(-194.9108, device='cuda:0', grad_fn=<SumBackward0>)


In [None]:
def gradient_descent(gradient, start, learn_rate, epochs):
    vector = start
    for _ in range(epochs):
        diff = -learn_rate * gradient(vector)
        vector += diff
    return vector

# Gradients
Pytorch is a linear algebra library on GPU with autograd

So what's autograd?

In [None]:
# https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
import torch

a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([5., 6.], requires_grad=True)

c = 3 * a + b * a
c

tensor([16., 27.], grad_fn=<AddBackward0>)

$\frac{dc}{da} = 3 + b = 3 + [5,6] = [8, 9]$

$\frac{dc}{db} = 0 + a = [2,3]$

In [None]:
# need to deposit tensors somewhere
external_grad = torch.tensor([1., 1.])
c.backward(gradient=external_grad)

# And then optimizer.step and that's it!

RuntimeError: ignored

In [None]:
print(f"a.grad {a.grad}")
print(f"b.grad {b.grad}")

a.grad tensor([8., 9.])
b.grad tensor([2., 3.])


# Save model and load later

In [None]:
PATH = 'mynet.pth'
torch.save(net.state_dict, PATH)

In [None]:
loaded_model = torch.load('/content/mynet.pth')

In [None]:
print(loaded_model)


<bound method Module.state_dict of TinyModel(
  (linear1): Linear(in_features=100, out_features=1000, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=1000, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)>


Deploy in production -> HF inference, torchserve

Share with friends -> S3, hf dataset, dropobox etc..

# Neural Network Compiler in Python torch.fx
Pytorch is mostly written in C++ but you can lots of interesting stuff in Python land

In [None]:
# https://pytorch.org/docs/stable/fx.html
import torch.fx as fx


def transform(m: torch.nn.Module,
              tracer_class : type = fx.Tracer) -> torch.nn.Module:
    graph : fx.Graph = tracer_class().trace(m)

    for node in graph.nodes:
      if node.target == "linear1":
        # Skip the first linear layer and and non linearity
        node.target = "linear2"

    graph.lint()
    return fx.GraphModule(m, graph)

In [None]:
net = TinyModel()
transform(net)

GraphModule(
  (linear2): Linear(in_features=1000, out_features=10, bias=True)
  (activation): ReLU()
  (softmax): Softmax(dim=None)
)

# FX applications
1. Export to runtime like TensorRT
2. Quantization
3. Feature extraction torchvision (intermediate feature representation)
4. Operator fusion
5. What else?


# In the not so distant future
`model.fit()` will be all you need.

Knowledge of how to train networks from scratch will be lost

There won't be any data without a data loader

*But maybe this talk helps keep civilization alive*

![apo](https://upload.wikimedia.org/wikipedia/commons/e/e1/Apocalypse_vasnetsov.jpg)

Four Horsemen of the Apocalypse, an 1887 painting by Viktor Vasnetsov. From left to right are Death, Famine, War, and Conquest; the Lamb is at the top.


# How to learn Pytorch
1. Read the docs!
2. Read good code e.g: Phil Wang, Ross Wightman, HuggingFace, Matthias Fey, fast.ai

# How to become a Pytorch developer
Get the attention of a Pytorch developer
1. Solve issues on github.com/pytorch (start with beginner or bootcamp tasks)
2. Create a cool project with Pytorch

Most importantly
> Follow the gradient of your interests