# Demystifying Deep Learning Part 1 Code Notebook

_Author: Dr Musashi Jacobs-Harukawa, DDSS Princeton_



## Introduction

This code notebook is designed both as supplementary resource to the lecture (presentation.md in this github repository).

A few things to note at the outset:

- I use PyTorch. Alternatives are available (primarily Tensorflow and JAX), but a) PyTorch abstracts at a very good level for understanding what is going on, b) it a dominant framework in industry and c) it's what I know best.
- The visualization code and other bits that I thought were less directly relevant are tucked away in the accompanying `demystifying_utils.py` file. You are welcome to inspect them if you're interested in these steps.


In [None]:
# Download the accompanying functions with the following command
!wget https://raw.githubusercontent.com/muhark/nn-tutorial/main/part1/demystifying_utils.py

In [None]:
# Some utilities
from tqdm import tqdm
from typing import Union, Optional, Literal

# Tools for creating toy datasets
from sklearn import datasets
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split

# Visualization Tools
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.express as px

# Core libraries for deep learning and numerical computing
import numpy as np
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

## Part 1: Linear Regression, Sort of Manually

- We begin in the familiar territory of trying to fit the best linear approximation of a relationship between two variables.
- This is something we do all the time in quantitative analysis - regression analysis.
- But whereas we previously focus on the model specification and interpretation part, this time we are going to focus on the "fitting" component.
- This exercise will help you understand a lot of the low-level operations that go on in neural networks!

Let's create some made-up data from scratch:


In [None]:
rng = np.random.default_rng(seed=0) # Reproducible random number generator
N_         = 100
coef_      = 3.4
intercept_ = 1.0
std_       = 1.3

X_ = rng.uniform(low=-5, high=5, size=N_).reshape(-1, 1)
y_ = coef_*X_ + intercept_ + rng.normal(loc=0, scale=std_, size=N_).reshape(-1, 1)

Note that even though I record the actual coefficient for the DGP, this won't be the optimal coefficient for the sampled data.

We can quickly check this by fitting a linear model (also from the `scikit-learn` library).


In [None]:
# OLS Model
lin_reg = LinearRegression()     # Instantiate linear regression model
lin_reg.fit(X_, y_)              # Fit to data
wOLS = lin_reg.coef_.item()      # .coef_ holds weights (betas) and
bOLS = lin_reg.intercept_.item() # .item() returns the scalar

In this figure I show the data points, the "true" model based off of the data-generating process, and the best model for the data using the OLS method.

(Feel free to check the accompanying file `demystifying_utils.py` if you want to know how the visualizations are made).


In [None]:
from demystifying_utils import visualize_linear_model
fig = visualize_linear_model(X_, y_, wOLS, bOLS)
# fig.write_html(write_to_html='figures/figure1.html')
fig

### How do we fit a line to some data point?

- Take a guess
- Calculate how far 

PyTorch offers built-in tools to conduct this process.

Let's begin by building our model up from values.

In [None]:
# Let's convert our data to PyTorch tensors (float dtype)
X_, y_ = torch.tensor(X_).float(), torch.tensor(y_).float()

# Fit model y = wx + b: learn w and b
# Creating tensors to store values of w and b
# Initializations as "empty"
w = torch.empty(1, 1, requires_grad=True)   # 1x1 uninitialised weight matrix
b = torch.empty(1, 1, requires_grad=True)   # 1x1 uninitialised bias matrix

For our first "guess", we can just use some random values.

There are more principled approaches to this step, but these are beyond the scope of this tutorial.

Instead we initialize the weights and biases with draws from the random normal distribution using ``nn.init.normal_(<param>)``


In [None]:
# We randomize the values of these coefficients
torch.manual_seed(0)
with torch.no_grad():                       # Will explain this later
    nn.init.normal_(w, mean=0, std=0.5)     # Fill with random values
    nn.init.normal_(b, mean=0, std=0.5)
print(w, b)

In [None]:
fig.add_trace(go.Scatter(x=[X_.min(), X_.max()],
                         y=[(X_.min() * w + b).detach().squeeze(), (X_.max() * w + b).detach().squeeze()],
                         mode='lines',
                         name=f'Random Guess: {w.item():.3g}x+{b.item():.3g}',
                         line=dict(color='green', dash='dash')))
fig.update_layout(title='Initial (Random) Guess')
# fig.write_html('figures/figure2.html')

What does this first guess look like?


In [None]:
fig

How do we improve this guess?

## Stochastic Gradient Descent

An algorithm for iteratively updating the parameters of a model to improve its fit to the data.

Let's begin with the first point in the dataset.

In [None]:
x0 = X_[0]
y0 = y_[0]

print(x0, y0, sep='\n')

Let's generate predictions for the value of `y` for each of these two points.


In [None]:
# Make prediction
yhat_ = x0 * w + b
print(yhat_)

How wrong were we?

In [None]:
error = (y0 - yhat_)
print(error)

However, we need a loss function that has a minimum at zero for reasons discussed in the lecture.

There are multiple options for this; in this case we square the error.

In [None]:
# Begin by defining the loss function: squared loss 
loss = (y0 - yhat_).pow(2)
loss

How does this connect to increasing the accuracy of the model?

Think about the function measuring loss as a function of the weight and bias: $L(w, b)$

We want to know how we should adjust the changeable parameters ($w$ and $b$) in order to reduce the size of our mistake ($L$).

We can do this with a bit of calculus. The _partial derivative_ $\frac{\delta L}{\delta w}$ describes how $L$ changes as a function of $w$, holding all else constant.

Pytorch contains the tools to automatically calculate this, but let's do it by hand so that we make sure that we understand it.

Returning to the definition of a partial derivative (https://en.wikipedia.org/wiki/Partial_derivative#Definition):

$$
\lim_{h \to 0} \frac{f(x+h)-f(x)}{h}
$$


In [None]:
# Calculating a partial derivative by hand!
h = 0.0001                           # h as some arbitrarily small value
fx =  (y0 - (x0 *  w    + b)).pow(2) # f(x)   (squared loss)
fxh = (y0 - (x0 * (w+h) + b)).pow(2) # f(x+h)
dfx = (fxh-fx)/h                     # (f(x+h)-f(x))/h
print(dfx)

In [None]:
# Pytorch does it for us as well
if w.grad is not None:
    w.grad -= w.grad            # Set gradient to 0, if any
loss = (y0 - (x0*w + b)).pow(2) # Same loss calculation
loss.backward()                 # Calculate dL w.r.t. all parameters (w, b)
print(w.grad)                   # dL/dw stored on tensor w

What do we do with $\frac{\delta L}{\delta w}$ and $\frac{\delta L}{\delta b}$?

We want to adjust the parameters in the direction of smaller loss.

(Think about it--if $\frac{\delta L}{\delta w}$ is positive and we increase w, then $L$ will increase!)

This is easiest to see if we visualize $L(w)$ and $\frac{\delta L}{\delta w}$.


In [None]:
# See if you can follow this code here
wvals = []
for wval in np.linspace(-1, 5, 100):
    wval = torch.tensor([wval], requires_grad=True).float()
    wval.retain_grad()
    loss = (y0 - (x0 * wval + b)).pow(2)
    loss.backward()
    wvals.append([wval.item(), loss.item(), wval.grad.item()])

# Make a figure from this
temp = go.Figure(
    data=[go.Scatter(x=[witem[0] for witem in wvals],
                         y=[witem[1] for witem in wvals],
                         name='L(w)',
                         mode='markers+lines'),
          go.Scatter(x=[witem[0] for witem in wvals],
                         y=[witem[2] for witem in wvals],
                         name='dL/dw',
                         visible='legendonly',
                         mode='markers+lines')],
    layout=dict(title=f'Squared Loss vs Loss Gradient for w when x={x0.item():.2f} and y={y0.item():.2f}',
                  xaxis_title='w Parameter Value',
                  yaxis_title='Loss, dL/dw'))
# temp.write_html('figures/figure3.html')
temp

So how much do we adjust our parameters $w$ and $b$?

Note that if we update our model to completely eliminate loss for each observation, then our model will bounce around between perfectly describing individual data points (and fail to capture some global structure).

### Gradient Descent

Gradient Descent is an optimization algorithm where after each guess, we adjust the parameter $w$ using the following formula:

$$w' = w - \eta \frac{dL}{dw}$$

Where $\eta$ is a parameter called the learning rate.

_Learning Rate_

A "penalty" on each update to limit overfitting on the basis of each individual point.

The exact value of the learning rate is a non-trivial hyperparameter to the model, and standard practice now is to vary it during training.

We will not cover this in depth, however.


In [None]:
lr = 1e-3 # Constant learning rate of 0.001

# Let's see if our loss goes down!
loss = (y0 - (x0*w + b)).pow(2) # Old loss
print("Old Loss:", loss.item())

# Backprop loss
loss.backward()

# Gradient descent formula w' = w - (lr * w.grad)
with torch.no_grad(): # Don't record this calculation
    w -= lr * w.grad
    b -= lr * b.grad

print("New Loss (after update): ", ((y0 - (x0*w + b)).pow(2)).item()) # New loss

Let's see how this guess compares to our previous one


In [None]:
fig.add_trace(go.Scatter(x=[X_.min(), X_.max()],
                         y=[(X_.min() * w + b).detach().squeeze(), (X_.max() * w + b).detach().squeeze()],
                         mode='lines',
                         name=f'First Guess: {w.item():.3g}x+{b.item():.3g}',
                         line=dict(color='red', dash='dash')))
fig.update_layout(title='Guess After 1 Observation/Update, $\eta=0.001$')
# fig.write_html('figures/figure4.html')
fig

We can repeat this for the entire dataset!

In [None]:
losses = []                    # Tracking loss
for i in tqdm(range(1, N_)):   # Skipping the observation we already have
    w.grad.zero_; b.grad.zero_ # Reset gradients
    x0 = X_[i]                 # Draw new samples
    y0 = y_[i]
    pred = (x0*w + b)          # Forward pass
    loss = (y0 - pred).pow(2)  # Loss calculation
    losses.append(loss.item()) # Record loss
    loss.backward()            # Backward pass 
    with torch.no_grad():      # Manual gradient descent
        w -= lr * w.grad
        b -= lr * b.grad

In [None]:
# Final figure
fig.add_trace(go.Scatter(x=[X_.min(), X_.max()],
                         y=[(X_.min() * w + b).detach().squeeze(), (X_.max() * w + b).detach().squeeze()],
                         mode='lines',
                         name=f'Final Guess: {w.item():.3g}x+{b.item():.3g}',
                         line=dict(dash='dash')))
fig.update_layout(title='Guess After 1 Epoch')
# fig.write_html('figures/figure5.html')
fig

### Challenge 1: Batch Gradient Descent

Stochastic gradient descent does one observation at a time.

How would you implement batch gradient descent, which uses the full dataset each time?

Enter code in the below box. The correct answers are in the boxes below.


In [None]:
# Enter your code here





**Answer:**

In [None]:
w.grad.zero_; b.grad.zero_       # Reset gradients
pred = (X_*w+b)                  # Forward pass
loss = (y_-pred).pow(2).mean()   # Means squared error loss
losses.append(loss.item())       # Record loss
loss.backward()                  # Backward pass 
with torch.no_grad():            # Manual gradient descent
    w -= lr * w.grad
    b -= lr * b.grad

In [None]:
# 30 epochs batch gd
epochs = 30
losses = []
for _ in tqdm(range(epochs)):
    w.grad.zero_; b.grad.zero_       # Reset gradients
    pred = (X_*w+b)                  # Forward pass
    loss = (y_-pred).pow(2).mean()   # Means squared error loss
    losses.append(loss.item())       # Record loss
    loss.backward()                  # Backward pass 
    with torch.no_grad():            # Manual gradient descent
        w -= lr * w.grad
        b -= lr * b.grad

## Part 2: Getting Harder

So far we've seen a slower and harder way to do regression.

Let's consider a problem that would be difficult (impossible) with a linear regression approach.

It's easiest to visualize it first.


In [None]:
from demystifying_utils import generate_2d_data

X, y = generate_2d_data('Moons')

In [None]:
# Train-test split to simulate 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# Visualize dataset, train/test split
# sorry--didn't have time to move it yet
fig = go.Figure()
fig.add_trace( # Visualize train data
    go.Scatter(x=X_train[:, 0],
               y=X_train[:, 1], 
                mode='markers',
                name='Train Set',
                hovertemplate="%{x:.3g}, %{y:.3g}",
                marker=dict(size=3, color=y_train, colorscale='bluered_r'),
                text=['Class 1' if y==1 else 'Class 0' for y in y_train]))
fig.add_trace( # Visualize test data
    go.Scatter(x=X_test[:, 0],
               y=X_test[:, 1],
               mode='markers',
               name='Test Set',
               hovertemplate="%{x:.3g}, %{y:.3g}",
               marker=dict(size=7, color=y_test, colorscale='bluered_r'),
               text=['Class 1' if y==1 else 'Class 0' for y in y_train]))
fig.update_layout(title='Classification Problem in Two Dimensions',
                  xaxis_title='X1',
                  yaxis_title='X2',
                  showlegend=True)
# fig.write_html('figures/figure6.html')

In [None]:
from demystifying_utils import ModelVisualization

### Using Classes and Modules

In the previous section, we used vectors (tensors) to hold all of our parameters and updated them against data.

In this step we're going to change from our mathematician to our engineer hat.

Let's create reusable blueprints to hold our models (both the structure and the parameters) and then populate them with data.

Take time to look at this code line-by-line (and the comments)

In [None]:
class LinearNeuron(nn.Module):         # It inherits methods from nn.Module
    def __init__(self,
                 features_in: int=1,     # Number of features in 
                 features_out: int=1,    # Number of features out
                 bias: bool=True):       # Add bias parameter?
        super(LinearNeuron,              # This basically adds methods from
              self).__init__()           # nn.Module to this class.
        self.weights = nn.Parameter(     # Need to define as parameter so
            torch.empty(                 # optimizer knows to optimize it.
                features_in,
                features_out))
        nn.init.normal_(self.weights)    # Initialize to some sane values
        self.bias = nn.Parameter(        # Some models do not need a bias
            torch.empty(1, features_out) # parameter, this is one way
            ) if bias else 0             # to implement that.
        if bias:
            nn.init.normal_(self.bias)
    
    def forward(self, inputs):           # Every nn.Module needs a forward func
        preds = torch.matmul(            # Define forward pass 
            inputs, self.weights         # W∙X
            ) + self.bias                # + b
        return preds                     # return predictions

We create an instance of this blueprint and call it `slnn`.

_Comprehension Check_: why do we set `features_in=2` below?


_Answer_: because we have two input variables, X1 and X2.


In [None]:
# Construct network
slnn = LinearNeuron(features_in=2, bias=True)

In [None]:
# %% Visualize the uninitialised model and its predictions
mv = ModelVisualization(slnn, X_test, y_test, h=0.1)
mv.fig.update_layout(title="Predictions of Uninitialised Model")
mv.fig

### Training a Module

This time, instead of using one observation at a time, we'll work with _batches_ of data.


In [None]:
# Again, let's convert our data to tensors
X_train, y_train = (torch.tensor(X_train, dtype=torch.float32),
                    torch.tensor(y_train, dtype=torch.float32
                                ).reshape(-1, 1)) # Needs to be 2D

# Use first 8 observations as first batch
inputs = X_train[:8, :]
labels = y_train[:8, :]

Instead of manually doing the gradient updates, we use an optimizer from `torch.optim`.

In this case I am using a Stochastic Gradient Descent (`SGD`) optimizer, which functions the same way as the manual updates we were doing above.

In most modern applications, we would use a better optimizer (namely `Adam` or its variants).


In [None]:
# Initialize optimizer
optim = torch.optim.SGD(slnn.parameters(), lr=5e-4)

# Reset the gradients
optim.zero_grad()

For the forward and backward pass, we can also use a built-in function for the loss.


In [None]:
# Forward pass: Data -> Predictions
preds = slnn(inputs) # We don't actually need to use the `.forward` function

# Loss calculation
loss_fn = nn.MSELoss()
loss = loss_fn(preds, labels)
loss.backward()

Now we can use the optimizer to update the weights with the `.step()` function.


In [None]:
# Update parameters
print(f"Weights: {slnn.weights}",
      f"Weights: {slnn.bias}",
      sep='\n')
print("="*12)
optim.step()
print("PARAMETER UPDATE")
print("="*12)
print(f"Weights: {slnn.weights}",
      f"Weights: {slnn.bias}",
      sep='\n')

### An Aside on Dataloaders

Note that above I manually selected the first 8 rows of the dataset to train the model.

This is an inefficient/inflexible approach.

Pytorch provides powerful tools for feeding data to your model.

It's a bit of a distraction to go into them for now, but we can come back at the end if there's time.


In [None]:
class SimpleDataset(Dataset):
    def __init__(self, X: torch.tensor, y: torch.tensor):
        self.features = X
        self.labels = y
    
    def __len__(self): # Required method 1; must return length of data as int
        return len(self.features)
    
    def __getitem__(self, idx: int): # Required method 2; how to grab data
        return self.features[idx, :], self.labels[idx, :]


dataset = SimpleDataset(X_train, y_train) # Instantiate dataset object
dataloader = DataLoader(dataset, batch_size=8, shuffle=True) # Wrap with DataLoader

In [None]:
slnn = LinearNeuron(2)
optim = torch.optim.SGD(slnn.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
mv = ModelVisualization(slnn, X_test, y_test, h=0.1)
# mv.fig.write_html('figures/figure7.html')
mv.fig

In [None]:
epochs = 50
for epoch in tqdm(range(epochs)):
# Single epoch of training
    for bn, (inputs, labels) in enumerate(dataloader):
        # Reset gradients
        optim.zero_grad()
        # Forward
        preds = slnn(inputs)
        # Loss
        loss = loss_fn(preds, labels)
        # Backward
        loss.backward()
        # Update
        optim.step()

In [None]:
mv = ModelVisualization(slnn, X_test, y_test, h=0.025)
mv.fig.update_layout(title="Predictions After 50 Epochs")
# mv.fig.write_html('figures/figure8.html')
mv.fig

### Improving the model

After 50 epochs, the model is doing roughly what we want. But there are a few improvements we might want:

- Better fit to response surface
- Classification instead of regression

Let's tackle the fit problem first (because the classification step is easy)

### Layering/Depth

Stacking and widening neural networks.

Let's revisit our original neural network. It has two inputs and a single output. We can represent it as the following diagram:

<!-- ![Simple Neural Network](./figures/nn2-1.svg) -->

In this diagram:

- nodes are data points
- edges are model weights

Therefore this diagram represents a regression model with X values, two weights (biases are omitted for ease of presentation), and a single output y.

But what if we create a wider network like the following?

<!-- ![Simple Neural Network](./figures/nn2-4.svg) -->

What is going on here?


In [None]:
widenn = LinearNeuron(features_in=2, features_out=4)
widenn(inputs)

What does four outputs mean?

You can think of this as training four separate regression models that take the same inputs, and output four separate values.

How do we turn four predictions back into one?

We can use a new network!


In [None]:
widenn2 = LinearNeuron(features_in=4, features_out=1)

hidden_layer = widenn(inputs)
preds = widenn2(hidden_layer)
print(inputs.shape, hidden_layer.shape, preds.shape)

Combining the two networks, we get something that looks like this:

<!-- ![Deep Neural Network](./part1/figures/nn2-4-1.svg) -->

Can we train our wide network?

Pytorch offers a convenient method for stacking networks:  `nn.Sequential`


In [None]:
sdnn = nn.Sequential(widenn, widenn2)
sdnn(inputs)

In [None]:
mvdnn = ModelVisualization(sdnn, X_test, y_test, h=0.1)
mvdnn.fig

Before we fit this model to the data, any guesses about how the prediction surface might change?


In [None]:
# We can calculate the average mistake before and after
nn.functional.mse_loss(                          # mse_loss function
    sdnn(torch.tensor(X_test).float()),          # Prediction
    torch.tensor(y_test).float().reshape(-1, 1)) # Truth

In [None]:
# 50 training epochs
optim = torch.optim.SGD(sdnn.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
epochs = 50
for epoch in tqdm(range(epochs)):
    for bn, (inputs, labels) in enumerate(dataloader):
        optim.zero_grad()
        preds = sdnn(inputs)
        loss = loss_fn(preds, labels)
        loss.backward()
        optim.step()
# Test loss
nn.functional.mse_loss(                          # mse_loss function
    sdnn(torch.tensor(X_test).float()),          # Prediction
    torch.tensor(y_test).float().reshape(-1, 1)) # Truth

But--prediction surface is still flat!


In [None]:
mvdnn = ModelVisualization(sdnn, X_test, y_test, h=0.025)
mvdnn.fig.update_layout(title="Predictions After 50 Epochs")
# mvdnn.fig.write_html('figures/figure9.html')
mvdnn.fig

It turns out that any linear combination of linear models reduces to a linear model. (A proof is beyond the scope of this workshop and also I haven't sat down to work it out).

### Activation Functions

The other key component to neural networks is the activation function.

You've probably already come across activation functions before--logistic regression!

Logistic regressions can be thought of as a sigmoid transformation on the output of a linear model.

Let's define a new neural network template that can take activation functions.


In [None]:
class Perceptron(nn.Module):
    def __init__(self,
                 features_in: int=1,
                 features_out: int=1,
                 activation: nn.modules.activation=nn.Sigmoid,
                 bias: bool=True):
        # This is same as before
        super(Perceptron,
              self).__init__()
        self.weights = nn.Parameter( torch.empty(features_in, features_out))
        nn.init.normal_(self.weights)
        self.bias = nn.Parameter(torch.empty(1, features_out)) if bias else 0
        if bias:
            nn.init.normal_(self.bias)
        # This is new
        self.activation = activation()
        
    
    def forward(self, inputs):
        preds = torch.matmul(inputs, self.weights) + self.bias
        output = self.activation(preds)  # Apply activation function
        return output

Aside: what _is_ the Sigmoid function?

Defined as $Sigmoid(x) = \frac{1}{1+exp(-x)}$.

We can visualize it:


In [None]:
px.line(x=torch.linspace(-5, 5, 100),
        y=torch.sigmoid(torch.linspace(-5, 5, 100)),
        labels = {'x': 'x', 'y': 'σ(x)'},
        title = 'Sigmoid function')#.write_html('figures/sigmoid.html')

The key takeaways:

- Symmetric over input 0
- 0 maps to 0.5
- Outputs are bounded between 0 and 1
- "Saturates" as input magnitudes increase

How do we use this for binary classification?

Let's train up a logistic classifier.

At this point we'll be doing a lot of training, so let's wrap our training in a function.


In [None]:
# Coding exercise - provide type hints for this function):
def train_n_epochs(model, dataloader, epochs, optim, loss_fn, lr=1e-3, verbose=True):
    optim = optim(model.parameters(), lr=lr)
    loss_fn = loss_fn()
    for epoch in tqdm(range(epochs), disable=~verbose):
        for inputs, labels in dataloader:
            optim.zero_grad()
            preds = model(inputs)
            loss = loss_fn(preds, labels)
            loss.backward()
            optim.step()
    return model

def eval_model(model, x_test, y_test, loss_fn):
    with torch.no_grad():
        loss = loss_fn(
            model(torch.tensor(X_test).float()),
            torch.tensor(y_test).float().reshape(-1, 1))
    return loss

Let's instantiate, train and evaluate our model:


In [None]:
# Instantiate
model = Perceptron(2, 1, nn.Sigmoid) # 2 inputs, 1 output, Sigmoid activation
loss_fn = nn.functional.binary_cross_entropy_with_logits # We use binary cross entropy for binary outcomes
print(eval_model(model, X_test, y_test, loss_fn)) # Eval starting point

# Train
train_n_epochs(model, dataloader, 50, torch.optim.SGD, nn.BCELoss, lr=0.03)

# Eval
print(eval_model(model, X_test, y_test, loss_fn)) # Eval completion

# Visualize predictions
ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title="Simple Logistic Regression After 50 Epochs")#.write_html('figures/figure10.html')

We still have a "flat" decision boundary, however. How can we improve this?

What happens if we stack two logistic models?


In [None]:
model = nn.Sequential(Perceptron(2, 2, nn.Sigmoid),
                       Perceptron(2, 1, nn.Sigmoid))
print(eval_model(model, X_test, y_test, loss_fn))
train_n_epochs(model, dataloader, 1000, torch.optim.SGD, nn.BCELoss, lr=0.03)
print(eval_model(model, X_test, y_test, loss_fn))
ModelVisualization(model, X_test, y_test, h=0.02).fig

What if we widen the middle layer?


In [None]:
model = nn.Sequential(Perceptron(2, 4, nn.Sigmoid),
                      Perceptron(4, 1, nn.Sigmoid))
print(eval_model(model, X_test, y_test, loss_fn))
train_n_epochs(model, dataloader, 1000, torch.optim.SGD, nn.BCELoss, lr=0.03)
print(eval_model(model, X_test, y_test, loss_fn))
# ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title="2-4-1 Logistic Network after 1000 Epochs").write_html('figures/figure11.html')
ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title="2-4-1 Logistic Network after 1000 Epochs")

Wider and deeper


In [None]:
model = nn.Sequential(Perceptron(2, 8, nn.Sigmoid),
                      Perceptron(8, 8, nn.Sigmoid),
                      Perceptron(8, 1, nn.Sigmoid))
print(eval_model(model, X_test, y_test, loss_fn))
train_n_epochs(model, dataloader, 1000, torch.optim.SGD, nn.BCELoss, lr=0.03)
print(eval_model(model, X_test, y_test, loss_fn))
# ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title="2-8-8-1 Logistic Network after 1000 Epochs").write_html('figures/figure12.html')
ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title="2-8-8-1 Logistic Network after 1000 Epochs")#.write_html('figures/figure12.html')

### Training to completion

You might notice that the performance is a bit sensitive with respect to the random initialization.

Supposing that we took three splits of the data: train, eval, test

We could train on train, use eval to decide when it's done, and then test on test.


In [None]:
model = nn.Sequential(Perceptron(2, 4, nn.Sigmoid),
                      Perceptron(4, 1, nn.Sigmoid))
total_epochs=0
loss = eval_model(model, X_test, y_test, loss_fn)
while loss > 0.52:
    print(loss)
    train_n_epochs(model, dataloader, 100, torch.optim.SGD, nn.BCELoss, lr=0.03, verbose=False)
    loss = eval_model(model, X_test, y_test, loss_fn)
    total_epochs += 100

In [None]:
ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title=f"2-4-1 Logistic Network after {total_epochs} Epochs")#.write_html('figures/figure13.html')

In [None]:
model = nn.Sequential(Perceptron(2, 4, nn.Sigmoid),
                      Perceptron(4, 4, nn.Sigmoid),
                      Perceptron(4, 4, nn.Sigmoid),
                      Perceptron(4, 1, nn.Sigmoid))
total_epochs=0
loss = eval_model(model, X_test, y_test, loss_fn)
while loss > 0.52:
    print(loss)
    train_n_epochs(model, dataloader, 100, torch.optim.SGD, nn.BCELoss, lr=0.03)
    loss = eval_model(model, X_test, y_test, loss_fn)
    total_epochs += 100

In [None]:
ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title=f"2-4-4-4-1 Logistic Network after {total_epochs} Epochs")#.write_html('figures/figure14.html')

On a final note, other activation functions exist:

![Activation Functions, from Kandel and Castelli 2020](https://www.researchgate.net/publication/339991922/figure/fig4/AS:870241110339586@1584493057180/Plot-of-different-activation-functions-a-Sigmoid-activation-function-b-Tanh.ppm)
<!-- ![Logistic Function](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1024px-Logistic-curve.svg.png) -->


In [None]:
model = nn.Sequential(Perceptron(2, 4, nn.ReLU),
                      Perceptron(4, 4, nn.ReLU),
                      Perceptron(4, 4, nn.ReLU),
                      Perceptron(4, 1, nn.Sigmoid))
total_epochs=0
loss = eval_model(model, X_test, y_test, loss_fn)
while loss > 0.52:
    print(loss)
    train_n_epochs(model, dataloader, 100, torch.optim.SGD, nn.BCELoss, lr=0.03, verbose=False)
    loss = eval_model(model, X_test, y_test, loss_fn)
    total_epochs += 100

In [None]:
ModelVisualization(model, X_test, y_test, h=0.02).fig.update_layout(title=f"ReLU Network after {total_epochs} Epochs")#.write_html('figures/figure15.html')