# Week 1: Neural Netwroks

**Learning Objectives**

* Get familiar with the diagram and components of a neural network.
* Understand the concept of a "layer" in a neural network.
* Understand how neural networks learn new features.
* Understand how activations are calculated at each layer.
* Learn how a neural network can perform classification on an image.
* Use a framework, TensorFlow, to build a neural network for classification of an image.
* Learn how data goes into and out of a neural network layer in TensorFlow.
* Build a neural network in regular Python code (from scratch) to make predictions.
* (Optional): Learn how neural networks use parallel processing (vectorization) to make computations faster.

## Inference in Code (TensorFlow)

### Key Concepts and Framework

* **TensorFlow:** One of the leading frameworks for implementing deep learning algorithms, used primarily for its efficiency and extensive ecosystem.
* **Inference:** The process of feeding an input feature vector (**$x$**) through a trained neural network to generate a prediction (**$\hat{y}$**).
* **Application Example (Coffee Roasting):** A simplified binary classification problem where the network predicts if a batch of coffee will taste "good" (positive cross, $y=1$) or "bad" (negative cross, $y=0$) based on two input features: **Temperature** and **Duration**.

### Implementing Inference in TensorFlow

The process involves defining the layers and sequentially passing the activations from one layer to the next.

1.  **Input Feature Vector ($x$):** The input is a numpy array containing the feature values (e.g., $x = [200, 17]$ for temperature and duration).

2.  **Defining Layer 1 (Hidden Layer):**
    * **Syntax:** `Layer_1 = tf.keras.layers.Dense(units=3, activation='sigmoid')`
    * **Interpretation:** Creates a **dense layer** (the standard layer type used so far) with 3 hidden units and the sigmoid activation function.

3.  **Forward Propagation for Layer 1:**
    * **Syntax:** `a1 = Layer_1(x)`
    * **Result:** The activation vector **$a_1$** is computed and contains three numbers, one for each unit.

4.  **Defining Layer 2 (Output Layer):**
    * **Syntax:** `Layer_2 = tf.keras.layers.Dense(units=1, activation='sigmoid')`
    * **Interpretation:** Creates the output layer with a single unit and the sigmoid activation function (suitable for binary classification).

5.  **Forward Propagation for Layer 2:**
    * **Syntax:** `a2 = Layer_2(a1)`
    * **Result:** The final activation value **$a_2$** (a single number, e.g., 0.8), which represents the model's confidence in the positive class.

6.  **Prediction ($\hat{y}$):** The final activation $a_2$ is optionally **thresholded** (usually at 0.5) to produce the binary prediction $\hat{y}$ (1 or 0).

The same layer construction (`tf.keras.layers.Dense`) can be extended to deep networks, such as those used for handwritten digit classification, by simply chaining more layers.

## TensorFlow vs Keras

Keras and TensorFlow are **not the same**, but they are intimately related. Keras is an **API** (Application Programming Interface) that serves as the high-level, user-friendly interface for building and training neural networks, while TensorFlow is the **complete framework** that provides the low-level, high-performance computational engine.

Think of TensorFlow as the **engine** of a powerful race car, and Keras as the user-friendly **dashboard and steering wheel** that makes the car easy to drive.

### Key Differences and Relationship

| Feature | TensorFlow | Keras |
| :--- | :--- | :--- |
| **Role/Level** | **Framework** (Low-Level) | **API** (High-Level Abstraction) |
| **Primary Goal** | Provides the numerical computation tools, scalability, and flexibility (the "backend"). | Simplifies the process of creating, configuring, and training deep learning models (the "frontend"). |
| **Usability** | More complex, requires more code, offers **fine-grained control** for advanced users and research. | Simple, intuitive, requires minimal code, ideal for **beginners** and **rapid prototyping**. |
| **Integration** | The entire system. | Originally a standalone library, it is now the **official high-level API** within TensorFlow (`tf.keras`). |

### The Modern Relationship

Since the release of TensorFlow 2.0, the relationship has been streamlined:

* **Integration:** Keras is now fully integrated into TensorFlow as `tf.keras`. When you use Keras layers or models, you are indirectly using TensorFlow's computational engine to perform the underlying math operations.
* **Default Use:** For most practitioners, TensorFlow recommends using the Keras APIs by default because they simplify the workflow for model building, training, and deployment.
* **Customization:** You use Keras for simplicity and speed, but you can always drop down to the lower-level TensorFlow APIs if you need complete, fine-grained control over every tensor operation.

In short, when you code a neural network today, you are typically using **TensorFlow** (the framework) through **Keras** (the easy-to-use API).

## Building a Neural Network using TensorFlow

### 1. Building a Neural Network with `Sequential`

  * **Sequential Model:** TensorFlow's `Sequential` function provides a simpler, more compact way to define a neural network by chaining layers together in a linear stack. This is the **standard convention** for coding in TensorFlow.
  * **Contrasts Explicit Forward Prop:** Instead of manually creating `Layer_1`, computing `a1 = Layer_1(X)`, creating `Layer_2`, and computing `a2 = Layer_2(a1)`, you define the entire structure upfront.
  * **Compact Code Convention:** Layers are typically defined and passed directly into the `Sequential` function, avoiding explicit assignment to individual layer variables (`Layer_1`, `Layer_2`).
    ```python
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(units=3, activation='sigmoid'),  # Layer 1
        tf.keras.layers.Dense(units=1, activation='sigmoid')   # Layer 2
    ])
    ```

### 2. Training and Inference (The Three Key Functions)

Once the model is defined, TensorFlow handles the heavy lifting of training and making predictions.

| Function | Purpose | Details |
| :--- | :--- | :--- |
| **`model.compile(...)`** | **Configuration** | Specifies the optimizer, loss function, and metrics for training (details covered in the next week). |
| **`model.fit(X, Y)`** | **Training** | Takes the input data (`X`) and target labels (`Y`) and trains the neural network. TensorFlow handles the backpropagation and gradient descent steps. |
| **`model.predict(X_new)`** | **Inference/Forward Prop** | Performs **forward propagation** on new input data (`X_new`) and outputs the network's predictions (e.g., the value of $a_L$), replacing the need for manual layer computations. |

### 3. Focus on Understanding

While libraries like TensorFlow and PyTorch allow building complex, state-of-the-art networks with just a few lines of code (e.g., `model.compile`, `model.fit`, `model.predict`), it's crucial to understand what is happening **under the hood**.

## TensorFlow code snippet

### Data normalization

Normalizing training data will help gradient descent converge faster.

```python
# Import necessery classses.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tenserflow.keras.layers import Dense, Normalization

# Normalized data.
norm_l = Normalization(axis=-1)
norm_l.adapt(X) # learns mean and variance
X_n = norm_l(X)
```

### Building a model

The goal is to build 2-layer NN model with 3 units in first layer and 1 unit in second layer:

```python

tf.random.seed(1234) # applied to archieve consistent results.

# Define the model.
model = Sequential(
    [
        tf.keras.Input(shape=(2,)) # specifies expected shape of the input (number of features)
        Dense(3, activation='sigmoid', name='layer1'),
        Dense(1, activation='sigmoid', name='layer2')
    ]
)

# Output description of the model.
model.summary()

# Extract initialized weights
W1, b1 = model.get_layer('layer1').get_weights()
W2, b2 = model.get_layer('layer2').get_weights()

# Define a loss function ans specifies a compile optimization
model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
)

# Train the model.
model.fit(Xt, Yt, epochs=10)

# Predict new examples.
X_testn = norm_l(X_test) # Normalize test data
model.predict(X_testn)
```

## A brief note on Adam optimization algorithm

The Adam optimizer (short for Adaptive Moment Estimation) is currently one of the most popular and effective algorithms for training deep neural networks. It is the go-to default optimizer for many TensorFlow (and PyTorch) projects because it typically converges faster and often performs better than traditional methods like Stochastic Gradient Descent (SGD).

Adam combines the best features of two simpler optimizers (Momentum and RMSProp) by tracking two key moments:

1.  **First Moment (Momentum):** Tracks the average of past gradients. This helps **accelerate** learning in the correct direction and **dampens oscillations** (wobbles) in the loss landscape.
2.  **Second Moment (Adaptive Scaling):** Tracks the average of the squared past gradients. This is used to **scale the learning rate**:
    * Parameters that have historically seen large updates get a **smaller, conservative learning rate**.
    * Parameters that have historically seen small updates get a **larger, more aggressive learning rate**.

### Benefits

* **Fast Convergence:** Typically reaches the optimal solution quicker than traditional methods like Stochastic Gradient Descent (SGD).
* **Ease of Use:** It is robust and generally requires minimal tuning of its default hyperparameters.
* **TensorFlow Integration:** Easily implemented when compiling a Keras model: `optimizer=tf.keras.optimizers.Adam(...)`.

## Concenpt of an epoch

That's a fundamental concept in machine learning! An **epoch** is simply one complete pass through the entire training dataset during the training of a machine learning model, typically a neural network.

Imagine you have a recipe book with 1,000 different recipes, and you need to learn to cook all of them perfectly.

* **Training Dataset:** The entire recipe book (all 1,000 recipes).
* **One Epoch:** You cook all 1,000 recipes exactly once.

During one epoch, the following sequence of events occurs for every single data point in your training set:

1.  **Forward Pass:** The data point is fed through the neural network to generate a prediction.
2.  **Loss Calculation:** The error (or "loss") between the prediction and the true label is calculated.
3.  **Backward Pass:** The error is backpropagated through the network.
4.  **Parameter Update:** The model's internal weights and biases are adjusted to reduce that error (using an optimizer like Adam).

Once all data points have been processed and the model's parameters have been updated based on the entire set, **one epoch is complete.**

### Why We Use Multiple Epochs

Training a model usually takes many epochs (often tens or hundreds) because a single pass isn't enough for the model to fully learn the complex patterns in the data.

* **Initial Epochs:** The model's error is high, and the weights change dramatically with each epoch as the model begins to learn the main features.
* **Later Epochs:** The model starts to fine-tune its parameters, and the change in error (loss) becomes smaller with each subsequent pass.

### Epochs vs. Iterations

It's important to distinguish epochs from **iterations** (or **steps**), especially when using Mini-Batch Gradient Descent:

| Term | Definition |
| :--- | :--- |
| **Epoch** | One complete pass through the **entire dataset** ($m$ examples). |
| **Iteration (or Step)** | One pass through a **single mini-batch** of data ($m_{batch}$ examples) resulting in one parameter update. |

If your training set has $m=10,000$ examples and your mini-batch size is $m_{batch}=100$, then:

$$\text{Iterations per Epoch} = \frac{\text{Total Examples (m)}}{\text{Mini-Batch Size (m}_{batch})} = \frac{10,000}{100} = 100$$

So, **1 epoch** requires **100 iterations** to complete.

### Stopping Training

You generally don't train for an arbitrary number of epochs. You monitor metrics to decide when to stop:

* **Convergence:** When the loss on the training set stops decreasing significantly.
* **Overfitting:** A critical point is reached when the model begins to perform worse on new, unseen data (the validation set). Training beyond this point means the model is memorizing the training data, and you should stop (a technique known as **Early Stopping**).