## Disclaimer

This workshop is heavily inspired by Andrej Karpathy's youtube video series on neural networks - mainly https://www.youtube.com/watch?v=VMj-3S1tku0

My approach was that I have watched it several times and then tried to reproduce it from scratch (memory) because it was very educational. Now I wanted to give you a sneak peek into what I have learned/"refreshed from memory".

## Takeaways
   * You should have intuition about why the code below works and why it has this "interface".
   * You should have intuition about what gradient is and how it is used in the context of neural networks.
   * Be motivated to learn more :)

In [1]:
train_x = [
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
]

train_y = [
    [0],
    [1],
    [1],
    [0]
]

import torch
import torch.nn as nn
import torch.optim as optim
from tqdm.notebook import tqdm

torch.random.manual_seed(42)  # Set a seed for reproducibility

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(2, 2)
        self.fc2 = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid() # https://en.wikipedia.org/wiki/Sigmoid_function

    def forward(self, x):
        x = self.fc1(x)
        x = self.sigmoid(x) # x = torch.relu(x) # much faster convergence
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

model = Model()

# https://en.wikipedia.org/wiki/Mean_squared_error
# Mean Squared Error (MSE) - the average of the squares of the errors, that is, the average squared difference between the estimated values and the actual value
criterion = nn.MSELoss()


# https://en.wikipedia.org/wiki/Stochastic_gradient_descent
# Stochastic - it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data)
# https://en.wikipedia.org/wiki/Gradient_descent (see Gradient Descent in 2D)
# Gradient - function whose value at a point gives the direction and the rate of fastest increase
# Gradient Descent - adding minus to go in the direction of the steepest descent
optimizer = optim.SGD(model.parameters(), lr=0.1)

# randomly initialized weights
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

"""
train_y = [
    [0],
    [1],
    [1],
    [0]
]
"""

tensor([0.6653], grad_fn=<SigmoidBackward0>)
tensor([0.6683], grad_fn=<SigmoidBackward0>)
tensor([0.6511], grad_fn=<SigmoidBackward0>)
tensor([0.6557], grad_fn=<SigmoidBackward0>)


In [2]:
model

Model(
  (fc1): Linear(in_features=2, out_features=2, bias=True)
  (fc2): Linear(in_features=2, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [3]:
model.fc1.weight

Parameter containing:
tensor([[ 0.5406,  0.5869],
        [-0.1657,  0.6496]], requires_grad=True)

In [4]:
model.fc1.bias

Parameter containing:
tensor([-0.1549,  0.1427], requires_grad=True)

In [5]:
_input = torch.tensor(train_x[0], dtype=torch.float32)
model(_input)

tensor([0.6653], grad_fn=<SigmoidBackward0>)

In [6]:
_input = torch.tensor(train_x[0], dtype=torch.float32)
_output = ((_input @ model.fc1.weight.t() + model.fc1.bias).sigmoid() @ model.fc2.weight.t() + model.fc2.bias).sigmoid()
_output

tensor([0.6653], grad_fn=<SigmoidBackward0>)

In [7]:
# Stochastic Gradient Descent
for epoch in tqdm(range(10_000)):
    # batch training - size 1
    for i in range(len(train_x)):
        inputs = torch.tensor(train_x[i], dtype=torch.float32)
        target = torch.tensor(train_y[i], dtype=torch.float32)

        optimizer.zero_grad() # we will see why this is important later
        outputs = model(inputs)
        loss = criterion(outputs, target)
        loss.backward() # we will focus on this part specifically
        optimizer.step()

  0%|          | 0/10000 [00:00<?, ?it/s]

In [8]:
# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

tensor([0.0483], grad_fn=<SigmoidBackward0>)
tensor([0.9567], grad_fn=<SigmoidBackward0>)
tensor([0.9425], grad_fn=<SigmoidBackward0>)
tensor([0.0434], grad_fn=<SigmoidBackward0>)


In [9]:
# Full Gradient Descent
model = Model()
optimizer = optim.SGD(model.parameters(), lr=1.)

inputs = torch.tensor(train_x, dtype=torch.float32)
targets = torch.tensor(train_y, dtype=torch.float32)

# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

for epoch in tqdm(range(10_000)):
    # full batch - usually not feasible for large datasets
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward() # we will focus on this part specifically
    optimizer.step()

# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

tensor([0.5916], grad_fn=<SigmoidBackward0>)
tensor([0.6037], grad_fn=<SigmoidBackward0>)
tensor([0.5989], grad_fn=<SigmoidBackward0>)
tensor([0.6101], grad_fn=<SigmoidBackward0>)


  0%|          | 0/10000 [00:00<?, ?it/s]

tensor([0.0157], grad_fn=<SigmoidBackward0>)
tensor([0.4996], grad_fn=<SigmoidBackward0>)
tensor([0.9828], grad_fn=<SigmoidBackward0>)
tensor([0.5006], grad_fn=<SigmoidBackward0>)


In [10]:
# Full Gradient Descent without optimizer and loss
model = Model()
inputs = torch.tensor(train_x, dtype=torch.float32)
targets = torch.tensor(train_y, dtype=torch.float32)

# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

for epoch in tqdm(range(10_000)):
    # Gradient descent - not stochastic
    for p in model.parameters():
        p.grad = None  # reset gradients

    outputs = model(inputs)
    loss = (outputs - targets).pow(2).mean()  # Mean Squared Error (MSE) loss
    loss.backward()

    for p in model.parameters():
        p.data -= 1. * p.grad  # update weights manually

tensor([0.3454], grad_fn=<SigmoidBackward0>)
tensor([0.3412], grad_fn=<SigmoidBackward0>)
tensor([0.3613], grad_fn=<SigmoidBackward0>)
tensor([0.3574], grad_fn=<SigmoidBackward0>)


  0%|          | 0/10000 [00:00<?, ?it/s]

In [11]:
# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

tensor([0.0171], grad_fn=<SigmoidBackward0>)
tensor([0.9819], grad_fn=<SigmoidBackward0>)
tensor([0.9819], grad_fn=<SigmoidBackward0>)
tensor([0.0223], grad_fn=<SigmoidBackward0>)


In [None]:
# the predictions is just a series of mathematical operations (which should be differentiable)

Where the ChatGPT (from the title of the presentation) is?

It is "hidden" in another Karpathy's repository - https://github.com/karpathy/nanoGPT/blob/master/model.py#L118
It is just Pytorch module (same as above) but with more layers and more complex architecture.

So if you understand the code above, you are step closer to understanding the ChatGPT architecture (more specifically GPT2, but it is not such clickbait :) ).

In [1]:
train_x = [
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
]

train_y = [
    [0],
    [1],
    [1],
    [0]
]

import torch
import torch.nn as nn
import torch.optim as optim
from tqdm.notebook import tqdm

torch.random.manual_seed(42)  # Set a seed for reproducibility

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(2, 2)
        self.fc2 = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid() # https://en.wikipedia.org/wiki/Sigmoid_function

    def forward(self, x):
        x = self.fc1(x)
        x = self.sigmoid(x) # x = torch.relu(x) # much faster convergence
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

model = Model()

# https://en.wikipedia.org/wiki/Mean_squared_error
# Mean Squared Error (MSE) - the average of the squares of the errors, that is, the average squared difference between the estimated values and the actual value
criterion = nn.MSELoss()


# https://en.wikipedia.org/wiki/Stochastic_gradient_descent
# Stochastic - it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data)
# https://en.wikipedia.org/wiki/Gradient_descent (see Gradient Descent in 2D)
# Gradient - function whose value at a point gives the direction and the rate of fastest increase
# Gradient Descent - adding minus to go in the direction of the steepest descent
optimizer = optim.SGD(model.parameters(), lr=0.1)

# randomly initialized weights
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

tensor([0.6653], grad_fn=<SigmoidBackward0>)
tensor([0.6683], grad_fn=<SigmoidBackward0>)
tensor([0.6511], grad_fn=<SigmoidBackward0>)
tensor([0.6557], grad_fn=<SigmoidBackward0>)


In [2]:
model

Model(
  (fc1): Linear(in_features=2, out_features=2, bias=True)
  (fc2): Linear(in_features=2, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [4]:
model.fc1.weight

Parameter containing:
tensor([[ 0.5406,  0.5869],
        [-0.1657,  0.6496]], requires_grad=True)

In [5]:
model.fc1.bias

Parameter containing:
tensor([-0.1549,  0.1427], requires_grad=True)

In [8]:
_input = torch.tensor(train_x[0], dtype=torch.float32)
_output = ((_input @ model.fc1.weight.t() + model.fc1.bias).sigmoid() @ model.fc2.weight.t() + model.fc2.bias).sigmoid()
_output

tensor([0.6653], grad_fn=<SigmoidBackward0>)

In [9]:
# Stochastic Gradient Descent
for epoch in tqdm(range(10_000)):
    # batch training - size 1
    for i in range(len(train_x)):
        inputs = torch.tensor(train_x[i], dtype=torch.float32)
        target = torch.tensor(train_y[i], dtype=torch.float32)

        optimizer.zero_grad() # we will see why this is important later
        outputs = model(inputs)
        loss = criterion(outputs, target)
        loss.backward() # we will focus on this part specifically
        optimizer.step()

  0%|          | 0/10000 [00:00<?, ?it/s]

In [10]:
# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

tensor([0.0483], grad_fn=<SigmoidBackward0>)
tensor([0.9567], grad_fn=<SigmoidBackward0>)
tensor([0.9425], grad_fn=<SigmoidBackward0>)
tensor([0.0434], grad_fn=<SigmoidBackward0>)


In [13]:
# Full Gradient Descent
model = Model()
optimizer = optim.SGD(model.parameters(), lr=1.)

inputs = torch.tensor(train_x, dtype=torch.float32)
targets = torch.tensor(train_y, dtype=torch.float32)

# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

for epoch in tqdm(range(10_000)):
    # full batch - usually not feasible for large datasets
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    loss.backward() # we will focus on this part specifically
    optimizer.step()

# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

tensor([0.5826], grad_fn=<SigmoidBackward0>)
tensor([0.5756], grad_fn=<SigmoidBackward0>)
tensor([0.5895], grad_fn=<SigmoidBackward0>)
tensor([0.5811], grad_fn=<SigmoidBackward0>)


  0%|          | 0/10000 [00:00<?, ?it/s]

tensor([0.0169], grad_fn=<SigmoidBackward0>)
tensor([0.9840], grad_fn=<SigmoidBackward0>)
tensor([0.9808], grad_fn=<SigmoidBackward0>)
tensor([0.0150], grad_fn=<SigmoidBackward0>)


In [16]:
# Full Gradient Descent without optimizer and loss
model = Model()
inputs = torch.tensor(train_x, dtype=torch.float32)
targets = torch.tensor(train_y, dtype=torch.float32)

# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

for epoch in tqdm(range(10_000)):
    # Gradient descent - not stochastic
    for p in model.parameters():
        p.grad = None  # reset gradients

    outputs = model(inputs)
    loss = (outputs - targets).pow(2).mean()  # Mean Squared Error (MSE) loss
    loss.backward()

    for p in model.parameters():
        p.data -= 1. * p.grad  # update weights manually

tensor([0.5726], grad_fn=<SigmoidBackward0>)
tensor([0.5547], grad_fn=<SigmoidBackward0>)
tensor([0.5736], grad_fn=<SigmoidBackward0>)
tensor([0.5557], grad_fn=<SigmoidBackward0>)


  0%|          | 0/10000 [00:00<?, ?it/s]

In [17]:
# after training
print(model(torch.tensor(train_x[0], dtype=torch.float32)))
print(model(torch.tensor(train_x[1], dtype=torch.float32)))
print(model(torch.tensor(train_x[2], dtype=torch.float32)))
print(model(torch.tensor(train_x[3], dtype=torch.float32)))

tensor([0.0210], grad_fn=<SigmoidBackward0>)
tensor([0.9820], grad_fn=<SigmoidBackward0>)
tensor([0.9820], grad_fn=<SigmoidBackward0>)
tensor([0.0182], grad_fn=<SigmoidBackward0>)


In [None]:
# the predictions is just a series of mathematical operations (which should be differentiable)

Where the ChatGPT (from the title of the presentation) is?

It is "hidden" in another Karpathy's repository - https://github.com/karpathy/nanoGPT/blob/master/model.py#L118
It is just Pytorch module (same as above) but with more layers and more complex architecture.

So if you understand the code above, you are step closer to understanding the ChatGPT architecture (more specifically GPT2, but it is not such clickbait :) ).