# Data Scaling (Feature Scaling) in Neural Networks

**1. Introduction**

*   **Context:** After discussing Early Stopping to prevent overfitting, another crucial step to improve neural network performance is **data scaling**, specifically **normalising inputs**.
*   **Purpose:** Data scaling aims to make neural network training faster and more stable by addressing issues arising from features with widely different scales.

**2. The Problem: Unscaled Input Features**

*   **Scenario:** Neural networks often work with datasets where input features (columns) have **very different scales or ranges**.
    *   **Example:** Consider a classification problem with two input features: **Age** and **Estimated Salary**.
        *   Age might range from 0 to 60-70 years.
        *   Estimated Salary could range from thousands to millions (e.g., 0 to 1 crore).
*   **Consequences of Unscaled Data:**
    *   When training a neural network on such data, the network takes a **very long time to learn** and update its weights.
    *   Training can be **unstable** or may **fail to converge** to an optimal solution.
    *   **Demonstration:** A deep learning model trained on unscaled 'Age' and 'Estimated Salary' data showed validation accuracy fluctuating between 40-60% and never going above 60% even after 100 epochs, indicating poor performance and lack of convergence.

**3. Why Unscaled Data Causes Problems (Geometric Intuition)**

*   **Backpropagation and Weight Updates:** During backpropagation, the algorithm updates the weights associated with each feature.
    *   If one feature (e.g., Salary) has a much larger range than another (e.g., Age), the changes in weights corresponding to the larger-scaled feature will be significantly more pronounced.
    *   This causes the backpropagation algorithm to **disproportionately focus on updating the weights of the larger-scaled feature**, effectively "ignoring" or giving less value to the weights of the smaller-scaled feature. This imbalance hinders effective learning.
*   **Cost Function Contour (Andrew Ng's Intuition):**
    *   **Unnormalised Data:** If input features are unnormalised (different scales), the **cost function's contour plot** (visualisation of loss) will be **non-symmetrical** and **elongated/unevenly stretched**.
        *   This shape forces the training process to **oscillate significantly** before potentially reaching the correct solution, making convergence slow and difficult.
    *   **Normalised Data:** If input features are normalised (same scale), the **cost function's contour plot** will be **symmetrical** and **circular**.
        *   This symmetrical shape allows the optimisation algorithm (like Gradient Descent) to **travel directly and efficiently** towards the optimal solution, leading to faster and more stable convergence with fewer oscillations.

**4. Solution: Feature Scaling (Normalising Inputs)**

*   **Principle:** The solution is to bring all input features to the **same scale** or range.
*   **Methods of Feature Scaling:** Two primary methods are discussed:
    *   **a) Standardisation (Z-score Normalisation):**
        *   **Formula:** For each value `x`, `(x - mean) / standard_deviation`.
        *   **Effect:** **Centers the data around a mean of 0** and gives it a **standard deviation of 1**.
        *   **Range:** Typically results in values ranging between **-1 and +1**.
        *   **Geometric Intuition:** Transforms the data into a "unit circle".
        *   **When to Use:**
            *   When the data follows a **normal distribution**.
            *   When the **minimum and maximum values of the feature are not known** or are subject to change.
    *   **b) Normalisation (Min-Max Scaling):**
        *   **Formula:** For each value `x`, `(x - minimum_value) / (maximum_value - minimum_value)`.
        *   **Effect:** Scales the data to a predefined fixed range, usually **0 to 1**.
        *   **Geometric Intuition:** Transforms the data into a "unit box".
        *   **When to Use:**
            *   When the **minimum and maximum values of the feature are known** and fixed (e.g., CGPA always 0-10).

**5. Implementation Example (using Scikit-learn's `StandardScaler`)**

*   **Steps:**
    1.  **Import:** Import `StandardScaler` from `sklearn.preprocessing`.
    2.  **Initialise:** Create an instance of `StandardScaler`.
    3.  **Transform Training Data:** Apply `fit_transform()` to your training input data (`X_train`). This calculates the mean and standard deviation from the training data and then applies the transformation.
    4.  **Transform Test Data:** Apply `transform()` to your test input data (`X_test`). It's crucial to use the mean and standard deviation learned from the *training* data to transform the test data, to prevent data leakage.
*   **Result:** The scaled data will have values, for example, between -1 and 1, as observed in the example 'Age' and 'Estimated Salary' data.
*   **Visual Confirmation:** A scatter plot of the scaled data will show the same distribution and relationships between features, but the axes (scales) will be different, confirming only the scale has changed, not the underlying data structure.

**6. Improved Performance After Scaling**

*   **Convergence:** After scaling the input features (e.g., using StandardScaler), retraining the same neural network model for the same number of epochs (e.g., 100 epochs) demonstrates a dramatic improvement.
*   **Accuracy:** The validation accuracy can significantly increase (e.g., reaching 90% or more) and clearly shows an upward trend towards convergence.
*   **Reason:** The symmetrical cost function allows the gradient descent algorithm to **easily traverse to the optimal solution**, leading to faster and more stable training.

**7. Key Takeaways and Best Practices**

*   **Always Scale:** It is **highly recommended to always scale your data** when working with neural networks. In most cases, it offers significant benefits and rarely causes harm.
*   **Crucial Technique:** Data scaling is an **important component** for training neural networks effectively and improving their performance.
*   **When Not to Scale:** If your data is **already on the same scale**, then feature scaling is not necessary.

---