## Improving Neural Network Performance: A Roadmap

There are two primary areas to focus on for enhancing ANN performance:
1.  **Fine-tuning Hyperparameters**.
2.  **Addressing Common Problems** that arise during training.

#### 1. Fine-Tuning Hyperparameters

**Hyperparameters** are values set by a Deep Learning engineer, rather than being learned by the model. Correctly setting these can significantly improve a neural network's performance.

The key hyperparameters discussed are:

*   **1.1. Number of Hidden Layers**
    *   Neural networks consist of an input layer, an output layer, and hidden layers in between.
    *   While it's possible to use a single hidden layer with many neurons, it's generally a **much better approach to have multiple hidden layers with fewer neurons per layer** (e.g., 3 layers with 128 neurons each, versus 1 layer with 512 neurons).
    *   **Reasoning (Representation Learning)**: Deeper networks excel at "representation learning." Earlier hidden layers capture primitive features (like lines or edges), intermediate layers combine these into shapes, and later layers form complex patterns (like a face). This hierarchical feature extraction is very powerful.
    *   **Reasoning (Transfer Learning)**: More hidden layers also facilitate "transfer learning." A model trained on one task (e.g., human face detection) can reuse its early, primitive feature-extracting layers for a similar task (e.g., monkey face detection), requiring only retraining of later layers.
    *   **How many layers?**: There's no fixed number. The guidance is to **increase the number of hidden layers until overfitting begins**.

*   **1.2. Number of Neurons per Layer**
    *   **Input Layer**: The number of neurons is determined by the **number of input features (columns)** in your dataset.
    *   **Output Layer**: The number of neurons depends on the **type of problem**: one for regression or binary classification, and `n` for multi-class classification where `n` is the number of classes.
    *   **Hidden Layers**: There is **no hard and fast rule**.
        *   Historically, a "pyramid structure" was suggested, where the number of neurons decreases in successive hidden layers (e.g., 64 -> 32 -> 16). The logic was that primitive features (more numerous) combine into fewer complex patterns.
        *   However, experiments have shown that a **rectangular structure** (e.g., 3 hidden layers all with 32 neurons) yields similar performance.
        *   **Key Principle**: Always ensure a **sufficient number of neurons**. If the initial layers don't capture enough primitive features, that information is lost and cannot be recovered later.
        *   **Recommendation**: Start with a **generous number of neurons** and reduce them if overfitting becomes an issue.

*   **1.3. Learning Rate and Optimizer**
    *   These hyperparameters significantly impact **training speed** and will be discussed in detail in the context of "slow training". An **optimizer** helps Gradient Descent converge faster and more stably than vanilla Gradient Descent.

*   **1.4. Batch Size**
    *   Batch size determines how many data points are processed before a single weight update in Mini-Batch Gradient Descent.
    *   **Two main approaches**:
        *   **Smaller Batch Sizes** (e.g., 8, 32): Generally lead to **better generalisation** (better performance on unseen data) but result in **slower training**.
        *   **Larger Batch Sizes** (e.g., 1024, 2048): Result in **faster training** because updates are less frequent. However, they can be less stable and might not generalise as well. Large batch sizes are often constrained by GPU RAM.
    *   **Warming Up the Learning Rate**: A technique used with large batch sizes where the **learning rate starts very small in early epochs and then rapidly increases**. This can lead to both fast training and good results.
    *   **Recommendation**: Try "warming up the learning rate" with a larger batch size first. If it doesn't yield good results, revert to using smaller batch sizes.

*   **1.5. Activation Function**
    *   The choice of activation function is crucial for solving problems like the Vanishing and Exploding Gradient Problem. This will be covered in more detail when discussing these problems.

*   **1.6. Epochs**
    *   The number of epochs (iterations over the entire dataset) to train a model.
    *   **Recommendation**: Don't worry about setting a fixed number. Instead, use a concept called **Early Stopping**.
    *   **Early Stopping**: This is an intelligent mechanism that monitors the model's performance (e.g., loss or accuracy on a validation set). It automatically stops training when there is no significant improvement over a certain number of consecutive epochs, preventing overfitting and saving computation time. This is often implemented using "callbacks" in deep learning frameworks.

#### 2. Common Problems in Neural Networks and Their Solutions

Even with fine-tuned hyperparameters, several problems can hinder a deep neural network's performance.

*   **2.1. Vanishing and Exploding Gradient Problems**
    *   **Description**: These problems occur during Backpropagation where gradients become either extremely small (vanishing) or extremely large (exploding), making weight updates ineffective or unstable. Vanishing gradients are common with Sigmoid/tanh activation functions in deep networks.
    *   **Solutions**:
        *   **Proper Weight Initialisation**: Instead of simple random initialisation, use methods like Glorot (Xavier) or He initialisation.
        *   **Change Activation Functions**: Replace Sigmoid/tanh with functions like **ReLU** (Rectified Linear Unit) or its variants (e.g., Leaky ReLU).
        *   **Batch Normalisation**: Normalises layer inputs, stabilising activations and preventing gradients from becoming too extreme.
        *   **Gradient Clipping**: Specifically for exploding gradients, it caps the maximum magnitude of gradients to prevent them from becoming too large.

*   **2.2. Inadequate Data**
    *   **Description**: Deep Learning models are "data-hungry". If you don't have enough data, your model might not learn effectively.
    *   **Solutions**:
        *   **Transfer Learning**: Use a pre-trained model (trained on a large dataset for a similar problem) and fine-tune it on your smaller dataset.
        *   **Pre-training**: Train a portion of your network using unsupervised or semi-supervised methods on available data before fine-tuning with supervised learning.

*   **2.3. Slow Training**
    *   **Description**: Training deep networks can be computationally intensive and time-consuming.
    *   **Solutions**:
        *   **Different Optimisers**: Use advanced optimisers beyond vanilla Gradient Descent, such as Adam, RMSprop, or Adagrad, which adapt learning rates and provide faster convergence.
        *   **Learning Rate Schedulers**: Dynamically adjust the learning rate during training (e.g., decrease it over time or use cyclical learning rates).

*   **2.4. Overfitting**
    *   **Description**: When a model performs very well on training data but poorly on unseen data, it has learned the training data too specifically, including noise. This is common in deep networks with many parameters.
    *   **Solutions**:
        *   **Regularisation (L1/L2)**: Adds a penalty to the loss function based on the magnitude of weights, discouraging overly complex models.
        *   **Dropout**: Randomly sets a fraction of neuron outputs to zero during training, preventing complex co-adaptations between neurons and forcing the network to learn more robust features.

These techniques, when understood and applied correctly, are essential for building high-performing neural networks and will be explored in detail in subsequent videos.