# Lecture Notes: Summary and Guidelines for Training Basic Deep Networks

## **Step 1: Determine the Type of Task**

<br>

<img src="./images/311.png" width="500" style="display: block; margin: auto;">

<br>

Ask yourself: **What kind of output is your model predicting?**

- Continuous values → You’re doing **regression**
- Discrete labels → You’re doing **classification**

## **For Regression Tasks**

If your model outputs continuous numbers:

- Use **L1 loss** (Mean Absolute Error) or  
- Use **L2 loss** (Mean Squared Error), also called **MSELoss**

In practice, L1 and L2 often work similarly. Start with MSELoss unless you have a specific reason to prefer L1.

## **For Classification Tasks**

If your model predicts classes, ask:

**How many classes are there?**

### Case 1: Binary Classification (2 classes)

- Use: `BCEWithLogitsLoss` (Binary Cross Entropy with logits)
- Output: One scalar per sample

### Case 2: Multi-Class Classification (more than 2 classes)

- Use: `CrossEntropyLoss`
- Output: One logit per class, shape `[batch_size, num_classes]`

Tip: You can use `CrossEntropyLoss` with 2 classes if you output 2 values instead of 1.

## **Multi-Label Classification (Tagging)**

If you’re predicting **multiple binary labels per sample**:

- Use: `BCEWithLogitsLoss` with multiple outputs
- Avoid: `CrossEntropyLoss`, because it assumes mutual exclusivity

---

## **Step 2: Choose an Activation Function**

<br>

<img src="./images/3111.png" width="500" style="display: block; margin: auto;">

<br>

- **Default**: `ReLU` — it’s fast, simple, and works well in most cases
- Use `LeakyReLU` or `PReLU` if you encounter “dead” ReLU units

If not using a custom model, stick with whatever activation a known architecture already uses unless you have a reason to change it.

---

## **Step 3: Pick an Optimizer**

<br>

<img src="./images/3112.png" width="500" style="display: block; margin: auto;">

<br>

Use **SGD** (Stochastic Gradient Descent) with:

- The largest batch size that fits in your GPU memory
- `momentum=0.9`

Example:

`torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)`

---

## **Step 4: Tune the Learning Rate**

Learning rate (`lr`) is **the most important hyperparameter** to tune.

### Strategy:

1. Start small (e.g., `0.001`)
2. Increase until the loss starts spiking → too high
3. If the model diverges (e.g., `loss = NaN`) → way too high
4. Back off slightly and use the largest stable value

Small learning rates = slow learning  
Large learning rates = unstable, can “explode” the model

---

## Final Notes

You now know how to:

- Choose a loss function based on your task type
- Use `ReLU` as a default activation
- Set up SGD with momentum
- Tune the learning rate

These form the foundation for training any deep neural network.

---

## What’s Next

In the rest of the course, you’ll learn:

- Architectural innovations
- Tricks for deeper networks
- Data-specific design ideas
- A “bag of tricks” to help you train better models
