In [1]:
import os
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
import torchvision.datasets as datasets
import torchvision.transforms as transforms

from fastai.vision.all import *

In [2]:
# This patch is to display epoch results while using Visual Studio Code
from IPython.display import clear_output, DisplayHandle
def update_patch(self, obj):
    clear_output(wait=True)
    self.display(obj)
DisplayHandle.update = update_patch

# Defining Model
To define a custom model using PyTorch, you need to create a class that inherits from the `nn.Module` class provided by the PyTorch library. The `nn.Module`, which allows layers to be stacked to form a network, is the most commonly used approach for building a Neural Network in PyTorch.

## nn.Liner
The Linear layer, also known as a fully connected layer or dense layer, is best represented by $f(wx + b)$, where $x$ represents a tensor containing the input features, $w$ and $b$ are the weight matrix and bias vector, respectively, and $f$ is the activation function. Because each layer in a NN receives input from the previous layer, its dimensionality is fixed. Typically, we only need to consider output dimensionality when designing a NN architecture. We can initialize a Linear layer using the following syntax.

`linear_layer = nn.Linear(in_features, out_features, bias=True)`

- `in_features` specifies the number of input features or dimensions.
- `out_features` specifies the number of output features or dimensions.
- `bias indicates` whether to include a bias term in the linear transformation (default is True).

When the forward pass is called on the `nn.Linear` layer, it performs the following computations:
- `Matrix multiplication:` The input tensor is multiplied with a weight matrix of shape (`out_features`, `in_features`). The weight matrix determines the learned parameters of the linear transformation and is internally stored in the `nn.Linear` module.
- `Addition of bias:` If `bias=True`, a bias term of shape (`out_features`,) is added element-wise to the result of the matrix multiplication.

In [13]:
# Initializing a Linear layer that will take 784 input features and output 10 features
linear = torch.nn.Linear(in_features=784, out_features=10, bias=True)
# The layer has 784 weights for each output neurons, hence 10x784
print(linear.weight.shape)
# The layer has 10 bias for 10 output neurons
print(linear.bias.shape)

torch.Size([10, 784])
torch.Size([10])


In [14]:
# Generates some random data of size 64x784 (batch size x features)
data = torch.randn((64, 784))
# Forward pass, multiplying 784 weights with 784 features
outputs = linear(data)
# Shape of the output will be 10 output features for each of the 64 data
print(outputs.shape)

torch.Size([64, 10])


## Weight Initialization

In PyTorch, weight initialization methods determine the initial values assigned to the weights of neural network layers. Proper weight initialization is crucial for effective training and convergence of neural networks. PyTorch provides several common weight initialization methods, each with its own characteristics. Here are explanations of some commonly used weight initialization methods in PyTorch:

- Uniform Initialization (`nn.init.uniform_`): This method initializes the weights with values drawn from a uniform distribution. It takes two arguments: `a` and `b`, representing the lower and upper bounds of the uniform distribution, respectively.
- Normal Initialization (`nn.init.normal_`): This method initializes the weights with values drawn from a normal distribution (Gaussian distribution). It takes two arguments: `mean` and `std`, representing the mean and standard deviation of the normal distribution, respectively.
- Xavier Initialization (`nn.init.xavier_uniform_` and `nn.init.xavier_normal_`): Xavier initialization, also known as `Glorot initialization`, is designed to work well with activation functions that have linear characteristics, such as `sigmoid` or `tanh`. `nn.init.xavier_uniform_` initializes the weights uniformly by drawing values from a uniform distribution with bounds determined by the `fan-in` and `fan-out` of the layer. `nn.init.xavier_normal_` initializes the weights from a normal distribution with zero mean and a variance determined by the `fan-in` and `fan-out` of the layer.
- He Initialization (`nn.init.kaiming_uniform_` and `nn.init.kaiming_normal_`): He initialization is designed for activation functions with `rectified linear unit (ReLU)` or its variants. `nn.init.kaiming_uniform_` initializes the weights uniformly using a uniform distribution determined by the `fan-in` of the layer. `nn.init.kaiming_normal_` initializes the weights from a normal distribution with zero mean and a variance determined by the `fan-in` of the layer.

In [16]:
liner = nn.Linear(in_features=10, out_features=1, bias=True)
nn.init.uniform_(linear.weight, a=10, b=20)

Parameter containing:
tensor([[14.1827, 19.1087, 17.5157,  ..., 18.4032, 18.0875, 11.0537],
        [13.2255, 18.5650, 17.6809,  ..., 11.5792, 12.9408, 13.9958],
        [16.2591, 17.9096, 10.2327,  ..., 19.9633, 12.9072, 16.8092],
        ...,
        [19.1229, 18.3452, 18.2393,  ..., 18.0992, 11.4814, 14.1419],
        [10.8830, 14.6537, 15.9298,  ..., 10.1425, 17.9116, 17.8116],
        [10.4411, 12.4530, 10.4453,  ..., 12.6293, 12.7216, 19.6306]],
       requires_grad=True)

In [17]:
liner = nn.Linear(in_features=10, out_features=1, bias=True)
nn.init.normal_(linear.weight, mean=5, std=2)

Parameter containing:
tensor([[5.1325, 4.3566, 5.0857,  ..., 6.6201, 1.9182, 2.5746],
        [6.5280, 7.9659, 4.0461,  ..., 3.9955, 5.2617, 4.6356],
        [5.8426, 3.2491, 5.4528,  ..., 7.0663, 7.4926, 7.4323],
        ...,
        [3.2003, 4.2170, 4.5990,  ..., 0.4067, 5.3184, 5.6994],
        [6.3685, 2.3249, 4.4892,  ..., 7.2444, 4.5465, 6.3840],
        [6.1262, 1.3187, 6.2343,  ..., 3.5328, 3.6782, 2.5086]],
       requires_grad=True)

In [18]:
liner = nn.Linear(in_features=10, out_features=1, bias=True)
nn.init.kaiming_normal_(linear.weight)

Parameter containing:
tensor([[-0.0267, -0.0071,  0.0242,  ...,  0.0293,  0.0399,  0.0873],
        [ 0.0302,  0.0453,  0.0273,  ...,  0.0377, -0.0090,  0.0221],
        [ 0.0429,  0.0717,  0.0486,  ..., -0.0208, -0.0237,  0.0387],
        ...,
        [ 0.0317, -0.0351,  0.0853,  ...,  0.1106, -0.0072,  0.0623],
        [ 0.0445,  0.0290,  0.0084,  ..., -0.0848, -0.0619, -0.0020],
        [ 0.0518,  0.0298, -0.1020,  ...,  0.0755, -0.0836,  0.0803]],
       requires_grad=True)

In [19]:
liner = nn.Linear(in_features=10, out_features=1, bias=True)
nn.init.xavier_uniform_(linear.weight)

Parameter containing:
tensor([[ 0.0646, -0.0578,  0.0169,  ..., -0.0700, -0.0438,  0.0082],
        [ 0.0549,  0.0047, -0.0811,  ...,  0.0694, -0.0202,  0.0320],
        [ 0.0073, -0.0778, -0.0712,  ...,  0.0288, -0.0054,  0.0132],
        ...,
        [-0.0806,  0.0283,  0.0164,  ..., -0.0418,  0.0827, -0.0269],
        [-0.0093, -0.0573,  0.0183,  ...,  0.0729, -0.0847,  0.0775],
        [ 0.0566,  0.0737, -0.0402,  ..., -0.0369, -0.0506, -0.0850]],
       requires_grad=True)

## Activation Functions
PyTorch provides various activation functions that introduce non-linearity in neural networks. Here are explanations of some commonly used activation functions in PyTorch, along with code examples and their advantages and disadvantages.

- **ReLU (Rectified Linear Unit)** (`torch.relu`):
  - Computational efficiency: ReLU is computationally efficient to compute compared to some other activation functions.
  - Simplicity: ReLU is simple and easy to implement.
  - Dead neurons: ReLU can lead to dead neurons, where neurons become non-responsive and do not contribute to learning.
  - Output saturation: ReLU can suffer from output saturation, where a large number of neurons become inactive during training.
- **Sigmoid** (`torch.sigmoid`):
  - Output range: Sigmoid function squashes the output between 0 and 1, which can be useful for binary classification tasks.
  - Smooth gradient: Sigmoid has a smooth derivative, making it well-suited for gradient-based optimization algorithms.
  - Vanishing gradient: Sigmoid is prone to the vanishing gradient problem, where gradients become very small during backpropagation, leading to slower learning.
  - Output saturation: Sigmoid can suffer from output saturation, where the output values tend to get close to the extremes (0 or 1), resulting in limited learning.
- **Tanh** (`torch.tanh`):
  - Output range: Tanh function squashes the output between -1 and 1, which can be useful for capturing negative values.
  - Zero-centered: Tanh is zero-centered, which can help in convergence during optimization.
  - Vanishing gradient: Tanh is also prone to the vanishing gradient problem.
  - Output saturation: Tanh can suffer from output saturation similar to the sigmoid function.
- **LeakyReLU** (`torch.nn.LeakyReLU`):
  - Avoids dead neurons: LeakyReLU helps to avoid the problem of dead neurons by allowing small negative values.
  - Linear region: LeakyReLU has a linear region for negative inputs, which can provide learning even when the gradient is negative.
  - Hyperparameter choice: The choice of the negative slope parameter requires tuning.
- **ELU** (`torch.nn.ELU`):
  - ELU has negative values for x <= 0, allowing the activation to capture negative information.
  - ELU helps mitigate the vanishing gradient problem by providing non-zero gradients for negative inputs, which can aid in better optimization.
  - ELU can provide better learning representations compared to traditional activation functions.
  - ELU introduces a slight computational overhead compared to other activation functions due to the exponential operation.
  - ELU may be sensitive to the choice of the hyperparameter alpha, which needs to be carefully tuned.
- **SELU (Scaled Exponential Linear Unit)** (`nn.SELU`):
  - SELU is a self-normalizing activation function, designed to ensure that the mean and standard deviation of the activations remain stable as the network deepens.
  - SELU has self-normalizing properties, allowing the mean and standard deviation of activations to remain stable as the network deepens. This can aid in training deep neural networks.
  - SELU can reduce the need for other regularization techniques, such as dropout or batch normalization.
  - SELU can provide improved performance compared to other activation functions in certain scenarios.
  - SELU requires careful initialization and scaling of weights to ensure the self-normalizing properties.
  - SELU is not recommended for networks with non-sequential architectures or networks that utilize other normalization techniques.
- **Softmax** (`torch.softmax`):
  - Probability distribution: Softmax transforms the input values into a probability distribution, which is useful for multi-class classification tasks.
  - Interpretability: Softmax outputs can be interpreted as class probabilities.
  - Sensitivity to outliers: Softmax is sensitive to outliers and can produce large outputs for extreme input values.
  - Lack of sparsity: Softmax tends to produce dense output vectors with non-zero probabilities for most classes.

In [20]:
leakyRelu = nn.LeakyReLU(negative_slope=0.01)
leakyRelu(outputs)

tensor([[ 1.7961e-01,  3.3568e-01, -8.6302e-04,  1.8407e-03, -4.3542e-03,
          2.1144e-01,  1.8037e-01,  4.3073e-01,  4.2648e-01, -1.3548e-02],
        [ 1.7430e-01, -2.7524e-04, -1.0590e-02,  8.0253e-01,  1.2953e-01,
          5.6662e-01, -1.4731e-04, -9.5626e-03,  2.3697e-01,  9.9528e-01],
        [ 5.7706e-02,  6.1784e-01,  4.6256e-01, -4.7602e-03, -1.1160e-03,
         -7.2501e-03, -7.3342e-03,  8.5763e-02,  3.9352e-01,  9.5922e-01],
        [-4.4337e-03, -5.5241e-03, -1.1487e-02,  2.2593e-01,  3.0301e-03,
          2.5830e-03, -2.7371e-04,  8.7060e-01,  5.5905e-01,  3.4141e-01],
        [ 4.4911e-01, -1.1538e-03, -7.9182e-04, -3.8672e-03, -9.7753e-04,
          3.4488e-01, -1.4066e-03,  1.4861e+00,  1.6382e-01,  4.7978e-01],
        [-2.0241e-03, -3.3248e-03, -4.5813e-03, -3.4967e-03,  6.3147e-01,
         -2.6534e-03, -5.7521e-03, -4.2019e-03, -2.7875e-03, -2.7169e-03],
        [-4.5183e-04, -5.6864e-03,  1.7076e-01, -4.2472e-03, -6.1661e-03,
         -8.0808e-03,  3.0534e-0

## Regularizations

In PyTorch, you can implement regularization techniques to prevent overfitting and improve the generalization of your neural network models. Here are a few commonly used regularization techniques and their implementations in PyTorch:

- **L2 Regularization (Weight Decay)**: L2 regularization, also known as weight decay, adds a penalty term to the loss function that encourages smaller weight values. PyTorch provides a built-in option to apply L2 regularization while defining the optimizer. You can specify the `weight_decay` parameter when creating an optimizer.
- **L1 regularization**: It encourages sparsity in the weights by adding a penalty term that promotes small absolute values. PyTorch does not have built-in support for L1 regularization in optimizers, but you can manually add the L1 penalty term to the loss function and optimize it.
- **Dropout**: Dropout randomly sets a fraction of the input units to zero during training, which helps to prevent overfitting and encourages the network to learn more robust features. PyTorch provides a `nn.Dropout` module that can be added to your model's architecture.

## Building Custom Model
In this case, we'd like to define a model with two hidden layers. The first takes $784$ features as input and projects them to $25$ neurons. Because we have $10$ class labels, the second layer receives the output of the previous layer (which has a size of $25$) and projects it to three $10$ output neurons.

In [7]:
# a simple classifier
class NN(nn.Module):
    def __init__(self, in_features, num_classes):
        super(NN, self).__init__()
        self.dropout = nn.Dropout(p=0.20)
        self.fc1 = nn.Linear(in_features, 25)
        nn.init.kaiming_normal_(self.fc1.weight)
        self.fc2 = nn.Linear(25, 15)
        nn.init.kaiming_normal_(self.fc2.weight)
        self.output = nn.Linear(15, num_classes)
        nn.init.kaiming_normal_(self.output.weight)
        
    def forward(self, x):
        x = self.dropout(x)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.output(x)
        return x

To accelerate operations in the neural network, we move it to the GPU if available.

In [8]:
# sets device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
# initializes and sends the model to appropriate GPU/CPU
model = NN(784, 10).to(device)

cuda


# Data Loader & Preprocessing
PyTorch offers domain-specific libraries such as TorchText, TorchVision, and TorchAudio, all of which include datasets. All of these are subclasses of `torch.utils.data.Dataset` and, therefore, can be used in `torch.utils.data.DataLoader` class. Find more about torchvision datasets at [here](https://pytorch.org/vision/0.8/datasets.html).

In [43]:
batch_size = 16

# load data
train_data = datasets.MNIST(
    root='./res/datasets',
    train=True,
    transform=transforms.ToTensor(),
    download=True
)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

test_data = datasets.MNIST(
    root='./res/datasets',
    train=False,
    transform=transforms.ToTensor(),
    download=True
)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [44]:
# Checks dimensions of each minibatch
x, y = next(iter(train_loader))
print(x.size())
print(y.size())

torch.Size([16, 1, 28, 28])
torch.Size([16])


In [45]:
# Checks how the target labels are encoded
y[0]

tensor(0)

# Hyperparameters & Optimizers

In [46]:
# sets hyperparameters
in_features = 784
num_classes = 10
learning_rate = 0.01
num_epochs = 5

To train a model, we need a loss function and an optimizer.

## Loss Functions
PyTorch provides a wide range of loss functions that cater to various machine learning tasks. Here's an explanation of some commonly used loss functions in PyTorch:

- **Mean Squared Error (MSE) Loss**: The MSE loss calculates the average squared difference between the predicted and target values. It is commonly used for regression tasks.
  - `loss_fn = nn.MSELoss()`
- **Binary Cross Entropy Loss**: Binary cross entropy loss measures the dissimilarity between binary predictions and binary targets. It is commonly used for binary classification tasks. The input to `torch.nn.BCELoss` is expected to be a tensor of sigmoid outputs between 0 and 1. The target values should be in the range [0, 1] representing the ground truth probabilities.
  - `loss_fn = nn.BCELoss()`
- **Binary Cross Entropy Loss with Logits**: Binary cross-entropy with logits loss is used for binary classification problems but takes logits as inputs instead of sigmoid outputs. This loss combines a sigmoid activation and binary cross-entropy loss in a single efficient computation. The input to `torch.nn.BCEWithLogitsLoss` is expected to be logits (real-valued scores) without applying sigmoid activation.
  - `loss = nn.BCEWithLogitsLoss()`
- **Cross Entropy Loss**: Cross entropy loss is used for multi-class classification tasks. It computes the negative log likelihood of the predicted class probabilities. The input to `torch.nn.CrossEntropyLoss` is expected to be logits (real-valued scores) without applying softmax activation. The target values should be class indices (integer labels) representing the ground truth class.
  - `loss_fn = nn.CrossEntropyLoss()`
- **Huber Loss**: Huber loss combines the best attributes of mean absolute error (MAE) and mean squared error (MSE) loss. It is less sensitive to outliers than MSE and provides a smoother loss landscape near zero.
  - `loss_fn = nn.SmoothL1Loss()`

In [47]:
criterion = nn.CrossEntropyLoss()

## Optimizers
In PyTorch, there are several optimization algorithms (optimizers) available for training neural networks. Each optimizer has its own characteristics and can be suitable for different scenarios. Here are explanations of different optimizers in PyTorch:

- **Stochastic Gradient Descent (SGD)**: SGD is a widely used optimization algorithm that updates the model parameters based on the gradients of the loss function with respect to those parameters.
- **RMSprop (Root Mean Square Propagation)**: RMSprop is an adaptive optimization algorithm that divides the learning rate by a running average of the root mean squared gradients. It helps to stabilize the learning process and adapt the learning rate to different parameters.
- **Adagrad (Adaptive Gradient Algorithm)**: Adagrad is an adaptive optimization algorithm that adapts the learning rate for each parameter based on the historical sum of squared gradients. It gives larger updates for infrequent parameters and smaller updates for frequent parameters.
- **Adam (Adaptive Moment Estimation)**: Adam is an adaptive optimization algorithm that combines ideas from both AdaGrad and RMSProp. It adapts the learning rate for each parameter based on estimates of the first and second moments of the gradients.

In [48]:
optimizer = optim.Adam(model.parameters(), lr=learning_rate, weight_decay=0.01)

In [49]:
loss_per_epoch = [0] * num_epochs
acc = [0] * num_epochs

# Training

In [50]:
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(train_loader):
        # Moves the data to the GPU/CPU
        data = data.to(device)
        # Converts the 2D image into a 1-D vector
        data = data.reshape(data.shape[0], -1)
        # Moves the target labels to GPU/CPU
        targets = targets.to(device)

        # Make predictions with the current parameters
        scores = model(data)
        # Calculates loss of the current minibatch
        loss = criterion(scores, targets)

        # Resets the gradients to zero
        optimizer.zero_grad()
        # Compute gradients of loss function with respect to parameters
        loss.backward()
        # Updates parameters
        optimizer.step()

        loss_per_epoch[epoch] += loss.item() * data.size(0)
        correct = (torch.argmax(scores, dim=1) == targets).float()
        acc[epoch] += correct.mean()

    loss_per_epoch[epoch] /= len(train_loader.dataset)
    acc[epoch] /= (len(train_loader.dataset) / batch_size)
    print('Epoch {}| Loss {} | Accuracy {}'.format(epoch, loss_per_epoch[epoch], acc[epoch]))


Epoch 0| Loss 0.9416069771726926 | Accuracy 0.6893333196640015
Epoch 1| Loss 0.8866419506152471 | Accuracy 0.7128666639328003
Epoch 2| Loss 0.8943693260550499 | Accuracy 0.7093166708946228
Epoch 3| Loss 0.8825760385394097 | Accuracy 0.7140499949455261
Epoch 4| Loss 0.8927643306493759 | Accuracy 0.7110999822616577


# Saving & Loading Models
Trained models can be saved to disk and reused in the future. When you call `save(model)`, **you are saving both the model architecture and all of the learned parameters**. As a standard practice, we can save models with the 'pt' or 'pth' file extensions.

In [51]:
if 'models' not in os.listdir():
    os.mkdir('models')
    print('models directory created!')
else:
    print('models directory already exists!')

models directory already exists!


In [52]:
torch.save(model, 'models/ann.pth')

In [53]:
model = torch.load('models/ann.pth')
model.eval()

NN(
  (dropout): Dropout(p=0.2, inplace=False)
  (fc1): Linear(in_features=784, out_features=25, bias=True)
  (fc2): Linear(in_features=25, out_features=15, bias=True)
  (output): Linear(in_features=15, out_features=10, bias=True)
)

Instead, if you wanted, you could also save just the parameters, not the architecture.

In [54]:
torch.save(model.state_dict(), 'models/ann_state.pth')

In [55]:
model = NN(784, 10)
model.load_state_dict(torch.load('models/ann_state.pth'))
model = model.to(device)

# Model Evaluation

In [56]:
def check_accuracy(loader, model):
    correct = 0
    total = 0
    model.eval()

    with torch.no_grad():
        for x, y in loader:
            x = x.reshape(x.shape[0], -1)
            x = x.to(device)
            y = y.to(device)

            scores = model(x)
            _, predictions = scores.max(1)
            correct += (predictions == y).sum()
            total += predictions.size(0)
        model.train()
        print('Accuracy: ', correct/total)

In [57]:
check_accuracy(test_loader, model)

Accuracy:  tensor(0.8491, device='cuda:0')


# Fast.AI
The `MNIST SAMPLE` dataset from fastai is a smaller version of the actual `MNIST Digits` dataset, and it contains images of only $5$ and $7$.

In [9]:
path = untar_data(URLs.MNIST_SAMPLE)
for dir in path.ls():
    print(dir)

C:\Users\musab\.fastai\data\mnist_sample\labels.csv
C:\Users\musab\.fastai\data\mnist_sample\train
C:\Users\musab\.fastai\data\mnist_sample\valid


In [10]:
print((path/'train').ls())
print((path/'valid').ls())

[Path('C:/Users/musab/.fastai/data/mnist_sample/train/3'), Path('C:/Users/musab/.fastai/data/mnist_sample/train/7')]
[Path('C:/Users/musab/.fastai/data/mnist_sample/valid/3'), Path('C:/Users/musab/.fastai/data/mnist_sample/valid/7')]


In [11]:
train_x = torch.cat((
    torch.stack([tensor(Image.open(f)).reshape(-1) for f in (path/'train'/'3').ls().sorted()]) / 255.0,
    torch.stack([tensor(Image.open(f)).reshape(-1) for f in (path/'train'/'7').ls().sorted()]) / 255.0
), dim=0)
print('Training Dataset: ', train_x.shape)

valid_x = torch.cat((
    torch.stack([tensor(Image.open(f)).reshape(-1) for f in (path/'valid'/'3').ls().sorted()]) / 255.0,
    torch.stack([tensor(Image.open(f)).reshape(-1) for f in (path/'valid'/'7').ls().sorted()]) / 255.0
), dim=0)
print('Validation Dataset: ', valid_x.shape)

Training Dataset:  torch.Size([12396, 784])
Validation Dataset:  torch.Size([2038, 784])


In [12]:
train_y = tensor([3] * len((path/'train'/'3').ls()) + [7] * len((path/'train'/'7').ls()))
print('Train Dataset: ', train_y.shape)
valid_y = tensor([3] * len((path/'valid'/'3').ls()) + [7] * len((path/'valid'/'7').ls()))
print('Validation Dataset: ', valid_y.shape)

Train Dataset:  torch.Size([12396])
Validation Dataset:  torch.Size([2038])


In [14]:
train_loader = DataLoader(L(zip(train_x, train_y)), batch_size=32, shuffle=True)
valid_loader = DataLoader(L(zip(valid_x, valid_y)), batch_size=32, shuffle=True)
dls = DataLoaders(train_loader, valid_loader)

## Learner
The Learner class in the fastai library provides a high-level interface for training models. It encapsulates the model, data, optimizer, loss function, metrics, and other essential components. Here are explanations of some important parameters of the Learner class in fastai:

- data (DataLoaders): This parameter represents the data loaders used for training and validation. It is an instance of the DataLoaders class, which contains the training and validation data along with other related settings.
- model (nn.Module): The model parameter refers to the neural network model to be trained. It should be an instance of a PyTorch `nn.Module` subclass.
- loss_func (callable): This parameter represents the loss function used during training. It should be a callable that takes the model predictions and target labels as inputs and returns a scalar value representing the loss.
- opt_func (callable): The opt_func parameter specifies the optimizer function used for parameter updates during training. It should be a callable that takes the model parameters as input and returns an optimizer object.
- metrics (list): The metrics parameter allows you to specify a list of evaluation metrics to be computed during training and validation. Each metric should be a callable that takes the model predictions and target labels as inputs and returns a value representing the metric score.
- cbs (list): This parameter stands for callbacks, which are objects or functions that can be used to customize the behavior of the training process. You can pass a list of callback objects or functions to the cbs parameter.
- splitter (callable): The splitter parameter is used to define how to split the data into training and validation sets.
It should be a callable that takes the data and returns two lists or tuples representing the training and validation subsets.
- wd (float or None): The wd parameter represents weight decay, which is a regularization technique that applies a penalty to the weights. If wd is a float value, weight decay is applied during training.
- callbacks (list): The callbacks parameter is similar to cbs but is used to specify a list of callback objects that provide additional functionality during training.
- learn (Learner): This parameter refers to the parent Learner object that can be used for callback chaining or accessing other aspects of the training process.

In [16]:
learner = Learner(dls,
                  NN(784, 10).to(device),
                  opt_func=Adam,
                  loss_func=nn.functional.cross_entropy,
                  metrics=[accuracy, error_rate, top_k_accuracy],
                  wd=0.1)
learner.fit(5, lr=0.01)

epoch,train_loss,valid_loss,accuracy,error_rate,top_k_accuracy,time
0,0.057971,0.049417,0.98528,0.01472,1.0,00:02
1,0.052985,0.039178,0.990677,0.009323,1.0,00:03
2,0.032094,0.041715,0.989205,0.010795,1.0,00:03
3,0.031951,0.036163,0.988714,0.011286,1.0,00:02
4,0.030255,0.043244,0.990186,0.009814,1.0,00:03
