---
title: 'under the hood: training a digit classifier'
description: 'fastai book chapter 4'
date: "2023-10-07"
date-format: iso
image: under_the_hood_training_a_digit_classifier_thumbnail.jpg
categories: [fastai, deeplearning, self-study]
toc: true
draft: true
title-block-banner: false
---
[blog](../../blog.qmd) > under the hood: training a digit classifier
---

![fastai book chapter 4 - jantxt - 2023](under_the_hood_training_a_digit_classifier_thumbnail.jpg){fig-align="left" width=40%}

# Intro

This my summary of chapter 4 from the book "Deep Learning for Coders with fastai & PyTorch".

 - questions - question about the chapter
 - key concepts - summarized key concepts of the chapter

::: {.callout-note}
## Links 

- Homepage: [fastai hompage](https://www.fast.ai/)
- Online Book: [fastai online book](https://course.fast.ai/Resources/book.html)
- Author: [jermey howard](https://jeremy.fast.ai/)
- Author: [sylvain gugger](https://sgugger.github.io/)
:::

# Questions

Questions about the chapter. 

[questions - chapter 4](subsite/under_the_hood_training_a_digit_classifier_questions.ipynb) 

# Code

x

# Key Concepts

Summarized key concepts ot this chapter.

## NumPy Arrays and PyTorchTensors

NumPy is the most widely used library for scientific and numeric programming in Python. A NumPy array is multidimensional table of data, with all items of the same type. Since that can be any type at all, they can even be arrays of arrays. 

Multidimensional tables are:

- 1-dimensional tensor/array is called a vector
```{python}
[1,2,3]
```

- 2-dimensional tensor/array is called a matrix
```{python}
[[1,2], 
[3,4]]
```

- 3-dimensional or more tensor/array is called a tensor
```{python}
[[[1, 2],
  [3, 4]],
 [[5, 6],
  [7, 8]]]
```

Python is slow compared to other languages. NumPy and PyTorch wrap functions in other language (specifically C) to be much faster. So NumPy arrays and PyTorch tensors can finish computation many thousands of times fast than using pure Python.

Pytorch vs Numpy:

- Pytorch tensors are similar to numpy arrays, but can also be operated on GPU. Numpy arrays are mainly used in typical machine learning algorithms to support faster mathematical operations whereas pytorch tensors ar mainly used in deep learning which requires heavy matrix computations on the GPU. In addition PyTorch can automatically calculate derivatives of these fast computations on the GPU.

::: {.callout-note}
On of the most important skills when working with Numpy arrays and PyTorch tensors ist to learn how to effectively use the array/tensors APIs to run their processing and computations with the optimized fast functions they provide.
:::

## Broadcasting

Broadcasting is a way to perform an operation between tensors that have similarities in their shapes. This is an important operation in deep learning. The common example is multiplying a tensor of learning weights by a batch of input tensors, applying the operation to each instance in the batch separately, and returning a tensor of identical shape. 

```{python}
# Broadcasting Example 1: Vector + Scalar
a = [1, 2, 3]        # Shape: (3,)
b = 2                # Shape: ()
# Result: [3, 4, 5]  # Shape: (3,)

# Broadcasting Example 2: Matrix + Vector
a = [[1, 2, 3],      # Shape: (2, 3)
     [4, 5, 6]]
b = [10, 20, 30]     # Shape: (3,)
# Result:            # Shape: (2, 3)
# [[11, 22, 33],
#  [14, 25, 36]]
```

The rules of broadcasting:
- each tensor must have a least one dimension - no empty tensor.
- comparing the dimension sizes of the two tensors, going from last to first:
    - each dimension must be equal of
    - one of the dimension must be of size 1, or
    - dimension does not exit in one of the tensors

## Stochastic Gradient Descent (SGD)

### What is SGD?
SGD is an optimization algorithm that updates model parameters to minimize the loss function. Think of it as helping the model learn from its mistakes by taking small steps in the right direction.

### How it Works
1. **Init**
   - Random weights (training from scratch)
   - Or pretrained weights (transfer learning)

2. **Predict**
   - Model makes predictions with current weights
   - Initially poor predictions, improves over time

3. **Loss**
   - Compare predictions with actual targets
   - Loss function measures how wrong predictions are
   - Key difference from regular gradient descent: uses random subset of data (mini-batch)

4. **Gradient**
   - Calculate direction of steepest improvement
   - Mountain analogy: 
     ```python
     # Example: Simple gradient calculation
     loss_gradient = 2 * (prediction - target)  # For MSE loss
     ```

5. **Step**
   - Update weights using learning rate (η)
   - Formula: `new_weight = old_weight - learning_rate * gradient`
   ```python
   # Example: Weight update
   weights = weights - learning_rate * gradient

## Gradient Descent vs Stochastic Gradient Descent

### 1. Regular Gradient Descent
- Uses ALL training data for each update

### 2. Stochastic Gradient Descent
- Uses SMALL RANDOM BATCHES of data

### Simple Comparison
Regular GD:

  - One big step
  - Very accurate
  - Slow to compute
  - More memory needed

Stochastic GD:

  - Many small steps
  - Less accurate per step
  - Fast to compute
  - Less memory needed

Regular GD: Reading an entire book before making notes

SGD: Making notes after each chapter and adjusting as you go

## Loss Function

### What is a Loss Function?
A loss function tells us how wrong our model's predictions are. 

Think of it as a "wrongness score":

- Low score = Good predictions
- High score = Bad predictions

### How It Works in Training
1. Model makes a prediction
2. Loss function calculates how wrong it was
3. Model tries to minimize this wrongness
4. Repeat until predictions get better

## Optimizers

## Nonlinearity

## Machine Learning Jargon