# Non-Linearity and Deep Networks

## Motivation: Why Linear Models Are Limited
- Linear models struggle with distinguishing non-linear patterns, e.g., a dog paw on gray background.
- They are incapable of modeling distinctions where the relevant features are symmetric or inverted versions (like black vs. white paws).

## The Solution: Add Non-Linearity
- Non-linear models overcome the expressive limitations of linear models.
- **Rectified Linear Unit (ReLU)** is the simplest and most common non-linearity:
  
  $$
  \text{ReLU}(x) = \max(0, x)
  $$

<img src="./images/relu.png" width="500" style="display: block; margin: auto;">

<br>

- ReLU is piecewise linear:
  - Identity for \( x > 0 \)
  - Constant zero for \( x < 0 \)
  - Gradient:
    - 0 for \( x < 0 \)
    - 1 for \( x > 0 \)
    - Undefined at \( x = 0 \), but this is ignored in practice

## Constructing a Simple Non-Linear Network
- Add a ReLU between two linear layers → enables the network to distinguish features like black vs. white pixels.

<img src="./images/relu-paw.png" width="500" style="display: block; margin: auto;">

<br>

- Intuition:
  - Use one neuron to detect "brighter than gray" (\( x - 0.5 \))
  - Use another to detect "darker than gray" (\( 0.5 - x \))
  - Apply ReLU to both outputs
  - Sum the absolute deviations from gray and subtract a constant

### Mathematical Example
- Assign pixel values:
  - Black = 0
  - Gray = 0.5
  - White = 1
- First layer outputs:
  - \( x - 0.5 \)
  - \( 0.5 - x \)
- After ReLU:
  - Compute \( |x - 0.5| \)
- Subtract a constant (e.g., 0.25) to normalize gray to 0
- This network can now distinguish black/white paws from gray background

## Deep Network Structure
- Deep networks are built by alternating **linear transformations** and **non-linearities**.
- Each layer maps from an \( n \)-dimensional input to a \( c \)-dimensional output:
  
  $$
  f(x) = \text{ReLU}(W_2(\text{ReLU}(W_1x + b_1)) + b_2)
  $$

- Parameters (weights and biases) are learned for **linear transformations**.
- Non-linearities usually don’t have learnable parameters (though some variants do).

<img src="./images/relu-lin.png" width="500" style="display: block; margin: auto;">

<br>

## What Is a Layer?
- A **layer** is a reusable computational block.
- Examples:
  - Linear Layer: \( Wx + b \)
  - Non-Linearity: ReLU, Sigmoid, etc.
- Deep networks are composed of these elementary blocks.
- Common stacking pattern:
  - Linear → ReLU → Linear → ReLU → ...

<img src="./images/layer.png" width="500" style="display: block; margin: auto;">

<br>

## Counting Layers
- "10-layer network" = 10 linear layers (non-linearities are not counted).
- In complex architectures:
  - Count the **maximum number of sequential linear transformations** an input goes through.
  - Parallel layers do **not** increase the layer count.

<img src="./images/layer-count.png" width="500" style="display: block; margin: auto;">

<br>

## Universal Approximation Theorem
- A network with just:
  - One hidden layer (with a non-linearity)
  - And a linear output layer
  - Can approximate **any continuous function** (under certain conditions)
- Implications:
  - Deep networks are **arbitrarily expressive**
  - But...

<img src="./images/uat.png" width="500" style="display: block; margin: auto;">

<br>

### Caveats
1. **Training may be difficult** — finding the right parameters is hard
2. **Construction may be inefficient** — theoretical networks might require excessive resources

## Practical Deep Learning Focus
- Instead of proving expressivity, we build **efficient architectures** that can:
  - Express the functions we care about
  - Be trained effectively on real data
- Non-linearities are what make deep networks **powerful** and **expressive**

<img src="./images/dntldr.png" width="500" style="display: block; margin: auto;">

<br>