# Week 2: Regression with multiple input variables

## Multiple features

When we have more than one feature, we can rewrite the our linear regression model as below:

$$
\begin{equation}
\begin{aligned}
f_{w,b}(\mathbf{x}) &= w_1x_1 + w_2x_2 + ... + w_nx_n + b \\
f_{w,b}(\mathbf{x}) &= \mathbf{\vec w}.\mathbf{\vec x} + b = \mathbf{w}^T\mathbf{x} + b
\end{aligned}
\label{eq: multiple_feat}
\end{equation}
$$

In above eqaution, $\mathbf{x}$ and $\mathbf{w}$ are $n\times 1$ vector of features and weights, respectively, and $b$ is a real number. We call this model **multiple linear regression** and NOT multivariate regression (this is something different).

In Python, we can use `Numpy`'s vectorized function, `numpy.dot()` to calculate $\mathbf{\vec w}.\mathbf{\vec x}$:

```Python
import numpy as np
f = np.dot(w, x) + b
```

## Vectorization

Here is a high-level comparison of vectorization and non-vectorizarion under-the-hood implementation:

![vectorization](../images/vectorization.png)

## Broadcasting in Python
Broadcasting in Python refers to a feature in libraries like **NumPy** that allows arithmetic operations between arrays of different shapes without creating duplicate copies of the data. Instead of requiring arrays to have identical dimensions for element-wise operations, broadcasting automatically "stretches" the smaller array to match the shape of the larger one, making operations more efficient in terms of both memory and speed.

### How Broadcasting Works
The core principle of broadcasting is a set of rules that determine if two arrays are "broadcastable." Two arrays are compatible for broadcasting if, for each dimension, they are either:
* Equal in size.
* One of the dimensions is 1.

The process aligns the arrays by adding new dimensions with size 1 to the smaller array on its left side until the arrays have the same number of dimensions. Then, it checks the dimensions one by one, from the last dimension to the first. 

Let's illustrate with an example: a $2 \times 3$ array and a $1 \times 3$ array.
* **Array A:** `[[1, 2, 3], [4, 5, 6]]` (Shape: $2 \times 3$)
* **Array B:** `[10, 20, 30]` (Shape: $3$)

To perform an element-wise operation, NumPy adds a new dimension to Array B, making its shape $1 \times 3$. It then "broadcasts" the values of Array B across the rows of Array A.

The operation then becomes:
$$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} + \begin{bmatrix} 10 & 20 & 30 \\ 10 & 20 & 30 \end{bmatrix} = \begin{bmatrix} 11 & 22 & 33 \\ 14 & 25 & 36 \end{bmatrix}$$

### Why Use Broadcasting?
Broadcasting offers significant advantages, especially in scientific computing and data analysis.

**Efficiency**

It avoids the need to create explicit large temporary arrays, which saves **memory** and is faster. For large datasets, this can be the difference between an operation that runs in seconds and one that crashes due to insufficient memory.

**Code Simplicity**

Broadcasting makes code cleaner and more readable. Instead of writing loops to apply an operation to each element, you can use a single, concise line of code. For example, to add a vector to each row of a matrix, you can simply write `matrix + vector` instead of iterating through each row.

### Performance
Because broadcasting is implemented in optimized, low-level C code within NumPy, it's often much faster than an equivalent operation written with Python loops.

### Limitations
While powerful, broadcasting has its own set of rules and limitations. If the arrays don't satisfy the broadcasting rules, NumPy will raise a `ValueError`. For example, trying to add a $2 \times 3$ array to a $1 \times 4$ array will fail because the last dimensions (3 and 4) are not equal or one of them is not 1. In such cases, you must reshape the arrays to make them compatible.

### Gradient descent for multiple linear regression

![multifeat multiple linear regression](../images/multifeat_lreg_gd.png)

### Feature scaling

Here is an intuition of how twicking different features range into comparable scales would increase gradient descent converage rate:

![feature scaling](../images/feat_scaling.png)

Typically, there are two common ways to scale features:
- Normalizing to feature range: divide data by range ($\frac{x - x_{min}}{x_{max} - x_{min}}$)
- Mean normalization: subtract data from mean and divide the result to data range ($\frac{x-\mu}{x_{max} - x_{min}}$)
- Z-score (standard) normalization: subtract data from mean and divide the result by standard deviation ($\frac{x-\mu}{\sigma}$)


**Rule of thumb for feature scaling:**

![feature scaling rule](../images/feat_scaling_rule.png)

### Choosing learning rate

![learning rate](../images/learning_rate.png)


**Implementation Note:** When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters from the model, we often want to predict the prices of houses we have not seen before. Given a new x value (living room area and number of bed- rooms), we must <u>first normalize x using the mean and standard deviation that we had previously computed from the training set</u>.

**Plotting data:**
- With multiple features, we can no longer have a single plot showing results versus features.
- When generating the plot, the normalized features were used. Any predictions using the parameters learned from a normalized training set must also be normalized.

### Feature Engineering

Using intuition to design bew features, by transforming or combining original features.

### Polynomial regression

![polynomial regression](../images/polynomial_reg.png)

Feature scaling is very important when we using higher orders of features because scales of features should be comparable to be able to make gradient descent work.

We need to make a choice what feature we should use or engineer to get e better model (will see in Course 2).

**Importance of certain features:**
- Gradient descent is picking the 'correct' features for us by emphasizing its associated parameter.
- Less weight value implies less important/correct feature, and in extreme, when the weight becomes zero or very close to zero, the associated feature is not useful in fitting the model to the data.

**A note on Numpy (numpy.c_):**

- `numpy.c_` is a convenient way to concatenate arrays along the second axis (column-wise).
- It's a special object in NumPy that acts as a shorthand for column stacking and is often more readable than using `np.concatenate` or `np.stack`.
- The `c_` object takes a sequence of array-like objects and stacks them as columns in a 2D array. The input arrays are automatically converted into 2D if they are 1D.
- Note that `np.c_` is not a function; it's an instance of a class (`numpy.lib.index_tricks.CClass`). When you use bracket notation (like `[]`), you're not calling a function but rather invoking the `__getitem__` method of this class. For example, `np.c_[a, b, c]` is a more compact way of writing:


```Python
np.concatenate([a[..., np.newaxis], b[..., np.newaxis], c[..., np.newaxis]], axis=1)
```

**Difference between `@` operator and numpy.dot() function:**

The primary difference is that `@` is an **operator** for matrix multiplication, while `numpy.dot()` is a **function**. While they often produce the same result for 2D arrays, they behave differently in other cases and have distinct purposes.

#### `@` Operator (Python 3.5+)

* `@` is a dedicated **matrix multiplication operator**. It was introduced in Python 3.5 to provide a more intuitive and readable syntax for this operation.
* **Behavior:** It performs matrix multiplication on 2D arrays. For 1D arrays, it performs the dot product. However, it's designed to be used with objects that support matrix multiplication, such as NumPy arrays.
* **Broadcasting:** The `@` operator does **not** support broadcasting in the same way as other NumPy operators like `*` or `+`. Both arrays must be of a compatible shape for matrix multiplication (i.e., the number of columns of the first array must match the number of rows of the second). 

#### `numpy.dot()` function

* `numpy.dot()` is a more general-purpose **function** that calculates the dot product of two arrays.
* **Behavior:**
    * For 2D arrays, `np.dot(a, b)` performs matrix multiplication.
    * For 1D arrays, it computes the inner product (the sum of the products of corresponding elements).
    * For a scalar and an array, it performs element-wise multiplication.
    * It can also handle higher-dimensional arrays.
* **Broadcasting:** Like the `@` operator, `np.dot()` does not have the same flexible broadcasting as element-wise operations.

#### Practical Differences and Recommendations

| Feature | `@` Operator | `numpy.dot()` Function |
| :--- | :--- | :--- |
| **Readability** | High: Clear intention for matrix multiplication. | Lower: Can be ambiguous depending on context. |
| **Flexibility** | Less flexible: Primarily for matrix multiplication. | More flexible: Can do inner product, element-wise, and matrix multiplication. |
| **Speed** | Often faster, as it's directly implemented for matrix multiplication. | Slightly slower in some cases, as it has more general logic to handle various dimensions. |

**Recommendation:**
* Use the **`@` operator** for **matrix multiplication of 2D arrays**. It's the most readable and modern approach for this specific task.
* Use **`numpy.dot()`** for **dot products of 1D vectors** or when you need its more general behavior for higher-dimensional arrays.