# Machine Learning Master Notes 10 - Feature Scaling

### Prepare Environment

In [1]:
%matplotlib inline
import math
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from matplotlib import cm

# SciKit Learn Regression Model
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

# SciKit Learn Pre-processing and Feature Scaling
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

# The following file contain the finalized gradient descent, cost function program 
import MyRegressionProgramV1 as my

## Multiple Linear Regression: Cost Function and Gradient Descent

Hypothesis: $$f_{\vec{w},b}(\vec{X}^{(i)})=b + \sum\limits_{j=0}^{n-1} \vec{w}_{j}\vec{X}_{j}^{(i)}$$


Cost Function:	$$J(\vec w, b) = \frac{1}{2m}   \sum\limits_{i=0}^{m-1} (f_{\vec w,b}(\vec{X}^{(i)})-\vec y^{(i)})^{2}$$ 
$$J(\vec w, b) = \frac{1}{2m} \sum\limits_{i=0}^{m-1} \left(\left(b + \sum\limits_{j=0}^{n-1} \vec w_{j} \vec X_{j}^{(i)} \right)-\vec y^{(i)}\right)^{2}$$
$$J(\vec w, b) = \frac{1}{2m} \sum\limits_{i=0}^{m-1} \left(\left(b + \vec X^{(i)} \cdot \vec w \right)-\vec y^{(i)}\right)^{2}$$
$$$$
Gradient Descent Algorithm: $$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline
\;  \vec w &= \vec w -  \alpha \frac{\partial J(\vec{w},b)}{\partial \vec{w}}  \; \newline 
 b &= b -  \alpha \frac{\partial J(\vec{w},b)}{\partial b}  \newline \rbrace
\end{align*}$$


Partial Derivatives: $$
\begin{align}
\frac{\partial J(\vec{w},b)}{\partial \vec{w}}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\vec{w},b}(\vec{X}^{(i)}) - y^{(i)})\vec{X}^{(i)} \\
  \frac{\partial J(\vec{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\vec{w},b}(\vec{X}^{(i)}) - y^{(i)}) \\
\end{align}
$$

Full Implementation of Gradient Descent:
$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline
\;  \vec{w} &= \vec{w} -  \alpha \frac{1}{m} \sum\limits_{i = 0}^{m-1} ((\vec{X}^{(i)}\cdot \vec{w} + b) - y^{(i)})\vec{X}^{(i)}  \; \newline 
 b &= b -  \alpha \frac{1}{m} \sum\limits_{i = 0}^{m-1} ((\vec{X}^{(i)}\cdot \vec{w} + b) - y^{(i)})  \newline \rbrace
\end{align*}$$

## Feature Scaling

### Why Do We Need Feature Scaling

Assume that when we use linear regression to predict housing prices, we have floor area in feet which range from 800 to 2000 sq foot. We also use the number of rooms which range from 1 to 5. When comparing with house size and number of rooms, the scale differs too much.

When using gradient descent algorithm, the program will compute slowly since we are working on different scales for each feature. The gradient descent algorithm will converge much faster when there are on the same scale.

From the previous example we worked before, we also encounter Python overflow errors as the numbers we deal with are very large. Using feature scaling, we can reduce the number so that it is more manageable.

The idea of feature scaling is to change the scale for each feature so that they are in a similar range. We can do so for example by dividing the number of square feet by 2000 and we can divide the number of rooms by 5. This will bring the two different scales into a similar range.

One of a main technique in feature scaling is to re-scaled all features into a number between 0 and 1. This technique is called Normalization. A common method of normalization feature scaling is MinMax Scaling.

Another feature scaling method is to convert the features data such that the standard deviation is 1 and the mean is 0. This type of scaling is suitable for dataset that are in Gaussian distribution form. This technique is also known as standardization. Common method for standardization is z-score scaling.

In summary, feature scaling provides the following advantages:

- **Features scaling allows gradient descent to converge faster and thus enhancing the performance of machine learning.**
- **Features scaling can also address outliers problem by converting the distribution into Gaussian distribution.**
- **Features scaling also helps to balance the impact of larger scale features against smaller scaled features.**   


**Additional Reference**

- https://medium.com/@shivanipickl/what-is-feature-scaling-and-why-does-machine-learning-need-it-104eedebb1c9
- https://www.baeldung.com/cs/normalization-vs-standardization
- https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
- https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35
- https://www.blog.trainindata.com/feature-scaling-in-machine-learning/

### Illustration of Feature Scaling with Training Example

The following training dataset contains three examples with four features (size, bedrooms, floors and, age) shown in the table below. 

| Size (sqft) | Number of Bedrooms  | Number of floors | Age of  Home | Price (1000s dollars)  |   
| ----------------| ------------------- |----------------- |--------------|-------------- |  
| 2104            | 5                   | 1                | 45           | 460           |  
| 1416            | 3                   | 2                | 40           | 232           |  
| 852             | 2                   | 1                | 35           | 178           |  



In [2]:
X_train1 = np.array([[2104, 5, 1, 45], [1416, 3, 2, 40], [852, 2, 1, 35]]).reshape((3,4))
y_train1 = np.array([460, 232, 178]).reshape((3,1))

In [3]:
w, b, cost_history, w_history, b_history = my.compute_gradient_descent(X_train1, y_train1)

  lossFunction = (fx - y) ** 2
  fx = (X@w)+b
  fx = ((X@w) + b)
  b = b - (alpha * db)
  w = w - (alpha * dw)


iteration 9999: Last cost = nan: intercept = nan: weights = [[nan nan nan nan]]
best w [[nan]
 [nan]
 [nan]
 [nan]]
best b nan


**In this case the square foot is too large for computation. Thus we have computation error.**

Let us re-scale the data by diving the house size by 1000 and the age of the house by 10.

In [4]:
X_scaled1 = X_train1 * np.array([0.001,1,1,0.1]).reshape((1,4))
X_scaled1

array([[2.104, 5.   , 1.   , 4.5  ],
       [1.416, 3.   , 2.   , 4.   ],
       [0.852, 2.   , 1.   , 3.5  ]])

In [5]:
w, b, cost_history, w_history, b_history = my.compute_gradient_descent(X_scaled1, y_train1)

iteration 9999: Last cost = 5.8545e-11: intercept = -1.9716e+00: weights = [[ 26.30148126  78.60186731 -46.06835442  13.26495614]]
best w [[ 26.3015]
 [ 78.6019]
 [-46.0684]
 [ 13.265 ]]
best b -1.9716


## Methods of Feature Scaling

### Method 1: Max Scaling (Divide by Max Value in the Range)

There are multiple ways to rescale the `x` features. First method is to use `x` and divide by the max number in the range of `x`. The formula is as follows:

$$x_{(scaled)} = \frac{x}{max(x)}$$

Example:

Using the example above:

If $x_1$ refers to square feet of a house and it ranges from $852<x_1<2104$.  

Then we need to divide $x$ by $2104$. After the conversion, the range should be as follows:

$$0.4<x_{1\_scaled}<1$$

If $x_2$ refers to number of rooms in a house and it ranges from $2<x_2<5$. Then:

$$0.4<x_{2\_scaled}<1$$

In the examples above, we can re-scaled the training data to the similar range so that they converge faster. 

In [6]:
x_1 = X_train1[:,0]
x_1

array([2104, 1416,  852])

In [7]:
x_2 = X_train1[:,1]
x_2

array([5, 3, 2])

In [8]:
x_1_scaled = x_1/max(x_1)
x_1_scaled

array([1.        , 0.6730038 , 0.40494297])

In [9]:
x_2_scaled = x_2/max(x_2)
x_2_scaled

array([1. , 0.6, 0.4])

In [10]:
X_train1

array([[2104,    5,    1,   45],
       [1416,    3,    2,   40],
       [ 852,    2,    1,   35]])

In [11]:
np.amax(X_train1, axis=0)

array([2104,    5,    2,   45])

In [12]:
X_scaled1 = X_train1/np.amax(X_train1, axis=0)
X_scaled1

array([[1.        , 1.        , 0.5       , 1.        ],
       [0.6730038 , 0.6       , 1.        , 0.88888889],
       [0.40494297, 0.4       , 0.5       , 0.77777778]])

The entire array will re-scaled between 0.4 and 1.

| Size (sqft) | Number of Bedrooms  | Number of floors | Age of  Home | Price (1000s dollars)  |   
| ----------------| ------------------- |----------------- |--------------|-------------- |  
| 1           | 1                   | 0.5               | 1.          | 460           |  
| 0.6730038            | 0.6                   | 1                | 0.88888889           | 232           |  
| 0.40494297             | 0.4                   | 0.5                | 0.77777778          | 178           |  

In [13]:
def max_scaler(X):
    """
    This function is max scaler.
    Formula is x(scaled) = x / (max of x)
    This function produce similar result as sklearn normalizer max
    """
    maxValue = np.amax(X, axis=0)
    scaled = X/maxValue
    return scaled

In [14]:
# Our function
print(max_scaler(X_train1))

[[1.         1.         0.5        1.        ]
 [0.6730038  0.6        1.         0.88888889]
 [0.40494297 0.4        0.5        0.77777778]]


In [15]:
# Similar scaling function in SciKit Learn
# Please note that we need to import preprocessing
from sklearn import preprocessing
print(preprocessing.normalize(X_train1, norm='max', axis=0))

[[1.         1.         0.5        1.        ]
 [0.6730038  0.6        1.         0.88888889]
 [0.40494297 0.4        0.5        0.77777778]]


#### Example 2

In [16]:
x_array2 = np.array([2,3,5,6,7,4,8,7,6]).reshape(-1,1)

In [17]:
# Our function
print(max_scaler(x_array2))

[[0.25 ]
 [0.375]
 [0.625]
 [0.75 ]
 [0.875]
 [0.5  ]
 [1.   ]
 [0.875]
 [0.75 ]]


In [18]:
# Similar scaling function in SciKit Learn
print(preprocessing.normalize(x_array2, norm='max', axis=0))

[[0.25 ]
 [0.375]
 [0.625]
 [0.75 ]
 [0.875]
 [0.5  ]
 [1.   ]
 [0.875]
 [0.75 ]]


#### Example 3

In [19]:
X_array3 = np.array([[2000,3,30,0.5], [1000,4,45,0.4], [1500,1,50,0.3]])
X_array3

array([[2.0e+03, 3.0e+00, 3.0e+01, 5.0e-01],
       [1.0e+03, 4.0e+00, 4.5e+01, 4.0e-01],
       [1.5e+03, 1.0e+00, 5.0e+01, 3.0e-01]])

In [20]:
# Our function
print(max_scaler(X_array3))

[[1.   0.75 0.6  1.  ]
 [0.5  1.   0.9  0.8 ]
 [0.75 0.25 1.   0.6 ]]


In [21]:
# Similar scaling function in SciKit Learn
print(preprocessing.normalize(X_array3, norm='max', axis=0))

[[1.   0.75 0.6  1.  ]
 [0.5  1.   0.9  0.8 ]
 [0.75 0.25 1.   0.6 ]]


### Method 2: Mean Normalization

For Mean Normalization, the formula is

$$x_{scaled} = \frac{x - \mu}{max(x)-min(x)}$$

Using mean normalization, the data will range between a negative number and positive number. In our training example, the data will range from $-0.5>X>0.66$

In [22]:
def mean_norm(X):
    """
    This function is mean normalizer.
    Formula is x(scaled) = x - mean / (max of x) - (min of x)
    There is no similar scaler in sklearn
    """
    big = X.max(axis=0)
    small = X.min(axis=0)
    norm_range = big - small
    avg = X.mean(axis=0)
    scaled = (X - avg) / norm_range
    return scaled, avg, norm_range

#### Example 1

In [23]:
# Our function
X_norm1, _, _ = mean_norm(X_train1)
print(X_norm1)
# There is no similar scaling function in SciKit Learn

[[ 0.51650692  0.55555556 -0.33333333  0.5       ]
 [-0.03301384 -0.11111111  0.66666667  0.        ]
 [-0.48349308 -0.44444444 -0.33333333 -0.5       ]]


#### Example 2

In [24]:
x_array2 = np.array([2,3,5,6,7,4,8,7,6]).reshape(-1,1)

# Our function
x_norm2, _, _ = mean_norm(x_array2)
print(x_norm2)
# There is no similar scaling function in SciKit Learn

[[-0.55555556]
 [-0.38888889]
 [-0.05555556]
 [ 0.11111111]
 [ 0.27777778]
 [-0.22222222]
 [ 0.44444444]
 [ 0.27777778]
 [ 0.11111111]]


#### Example 3

In [25]:
X_array3 = np.array([[2000,3,30,0.5], [1000,4,45,0.4], [1500,1,50,0.3]])
# Our function
X_norm3, _, _ = mean_norm(X_array3)
print(X_norm3)
# There is no similar scaling function in SciKit Learn

[[ 5.00000000e-01  1.11111111e-01 -5.83333333e-01  5.00000000e-01]
 [-5.00000000e-01  4.44444444e-01  1.66666667e-01  2.77555756e-16]
 [ 0.00000000e+00 -5.55555556e-01  4.16666667e-01 -5.00000000e-01]]


### Method 3: Z-Score Normalization (Standardization)

For Z-Score Normalization, the formula is

$$x_{scaled} = \frac{x - \mu}{\sigma}$$


<div class="alert alert-block alert-warning"><b>
Please note that Numpy and Pandas use different definitions of standard deviation (sample vs. population standard deviation). The result may be slightly different. More will be explained below.
</b></div>

In [26]:
def std_norm_v1(X):
    """
    This function is z-score normalizer.
    Formula is x(scaled) = x - mean / {std deviation of x}
    There is similar scaler in sklearn is StandardScaler
    """
    
    avg = X.mean(axis=0)

    std = X.std(axis=0)
    
    X_norm = (X-avg)/std
    
    return X_norm, avg, std

In [27]:
# Our function
X_norm1, _, _ = std_norm_v1(X_train1)
print(X_norm1)

[[ 1.26311506  1.33630621 -0.70710678  1.22474487]
 [-0.08073519 -0.26726124  1.41421356  0.        ]
 [-1.18237987 -1.06904497 -0.70710678 -1.22474487]]


In [28]:
# Similar scaling function in SciKit Learn
std_scaler = StandardScaler().fit(X_train1)
normalized_arr = std_scaler.transform(X_train1)
print(normalized_arr)

[[ 1.26311506  1.33630621 -0.70710678  1.22474487]
 [-0.08073519 -0.26726124  1.41421356  0.        ]
 [-1.18237987 -1.06904497 -0.70710678 -1.22474487]]


#### Example 2

In [29]:
x_array2 = np.array([2,3,5,6,7,4,8,7,6]).reshape(-1,1)

In [30]:
# Our function
x_norm2, _, _ = std_norm_v1(x_array2)
print(x_norm2)

[[-1.76776695]
 [-1.23743687]
 [-0.1767767 ]
 [ 0.35355339]
 [ 0.88388348]
 [-0.70710678]
 [ 1.41421356]
 [ 0.88388348]
 [ 0.35355339]]


In [31]:
# Similar scaling function in SciKit Learn
std_scaler = StandardScaler().fit(x_norm2)
normalized_arr = std_scaler.transform(x_norm2)
print(normalized_arr)

[[-1.76776695]
 [-1.23743687]
 [-0.1767767 ]
 [ 0.35355339]
 [ 0.88388348]
 [-0.70710678]
 [ 1.41421356]
 [ 0.88388348]
 [ 0.35355339]]


#### Example 3

In [32]:
X_array3 = np.array([[2000,3,30,0.5], [1000,4,45,0.4], [1500,1,50,0.3]])
X_array3

array([[2.0e+03, 3.0e+00, 3.0e+01, 5.0e-01],
       [1.0e+03, 4.0e+00, 4.5e+01, 4.0e-01],
       [1.5e+03, 1.0e+00, 5.0e+01, 3.0e-01]])

In [33]:
# Our function
X_norm3, _, _ = std_norm_v1(X_array3)
print(X_norm3)

[[ 1.22474487e+00  2.67261242e-01 -1.37281295e+00  1.22474487e+00]
 [-1.22474487e+00  1.06904497e+00  3.92232270e-01  6.79869978e-16]
 [ 0.00000000e+00 -1.33630621e+00  9.80580676e-01 -1.22474487e+00]]


In [34]:
# Similar scaling function in SciKit Learn
std_scaler = StandardScaler().fit(X_array3)
normalized_arr = std_scaler.transform(X_array3)
print(normalized_arr)

[[ 1.22474487e+00  2.67261242e-01 -1.37281295e+00  1.22474487e+00]
 [-1.22474487e+00  1.06904497e+00  3.92232270e-01  6.79869978e-16]
 [ 0.00000000e+00 -1.33630621e+00  9.80580676e-01 -1.22474487e+00]]


**The above confirm that our functions and SciKit Learn function products same results.**

## SciKit Learn Feature Scaling

SKlearn prodives a list of scaling methods:
- StandardScaler
- MinMaxScaler
- MaxAbsScaler
- RobustScaler
- PowerTransformer
- QuantileTransformer
- Normalizer

We will only be looking at:
- StandardScaler
- MinMaxScaler

### StandardScaler (Same as Z-Score Normalization - Standardization Technique)


In SciKit Learn StandardScaler use the formula below which is the same as our Z-Score Normalization formula:

$$x_{scaled} = \frac{x - \mu}{\sigma}$$
$$ $$
The effect of this scaling technique is to have a **mean of 0** and **standard deviation of 1**.


In [35]:
# Our function
X_norm1, _, _ = std_norm_v1(X_train1)
print(X_norm1)

[[ 1.26311506  1.33630621 -0.70710678  1.22474487]
 [-0.08073519 -0.26726124  1.41421356  0.        ]
 [-1.18237987 -1.06904497 -0.70710678 -1.22474487]]


In [36]:
# Similar scaling function in SciKit Learn
std_scaler = StandardScaler().fit(X_train1)
normalized_arr = std_scaler.transform(X_train1)
print(normalized_arr)

[[ 1.26311506  1.33630621 -0.70710678  1.22474487]
 [-0.08073519 -0.26726124  1.41421356  0.        ]
 [-1.18237987 -1.06904497 -0.70710678 -1.22474487]]


#### Example 2

In [37]:
x_array2 = np.array([2,3,5,6,7,4,8,7,6]).reshape(-1,1)

# StandardScaler in SciKit Learn
std_scaler = StandardScaler().fit(x_array2)
normalized_arr = std_scaler.transform(x_array2)
print(normalized_arr)

[[-1.76776695]
 [-1.23743687]
 [-0.1767767 ]
 [ 0.35355339]
 [ 0.88388348]
 [-0.70710678]
 [ 1.41421356]
 [ 0.88388348]
 [ 0.35355339]]


In [38]:
print('scaler mean',std_scaler.mean_)
print('scaler std deviation',std_scaler.scale_)

scaler mean [5.33333333]
scaler std deviation [1.88561808]


In [39]:
# Our function that is same as StandardScaler
x_norm2, avg, stddev = std_norm_v1(x_array2)
print(x_norm2)

[[-1.76776695]
 [-1.23743687]
 [-0.1767767 ]
 [ 0.35355339]
 [ 0.88388348]
 [-0.70710678]
 [ 1.41421356]
 [ 0.88388348]
 [ 0.35355339]]


In [40]:
print('our function mean',avg)
print('our function std deviation',stddev)

our function mean [5.33333333]
our function std deviation [1.88561808]


#### Example 3

In [41]:
X_array3 = np.array([[2000,3,30,0.5], [1000,4,45,0.4], [1500,1,50,0.3]])
X_array3

array([[2.0e+03, 3.0e+00, 3.0e+01, 5.0e-01],
       [1.0e+03, 4.0e+00, 4.5e+01, 4.0e-01],
       [1.5e+03, 1.0e+00, 5.0e+01, 3.0e-01]])

In [42]:
# StandardScaler in SciKit Learn
std_scaler = StandardScaler().fit(X_array3)
normalized_arr = std_scaler.transform(X_array3)
print(normalized_arr)

[[ 1.22474487e+00  2.67261242e-01 -1.37281295e+00  1.22474487e+00]
 [-1.22474487e+00  1.06904497e+00  3.92232270e-01  6.79869978e-16]
 [ 0.00000000e+00 -1.33630621e+00  9.80580676e-01 -1.22474487e+00]]


In [43]:
print('scaler mean',std_scaler.mean_)
print('scaler std deviation',std_scaler.scale_)

scaler mean [1.50000000e+03 2.66666667e+00 4.16666667e+01 4.00000000e-01]
scaler std deviation [4.08248290e+02 1.24721913e+00 8.49836586e+00 8.16496581e-02]


In [44]:
# Our function that is same as StandardScaler
X_norm3, avg, stddev = std_norm_v1(X_array3)
print(X_norm3)

[[ 1.22474487e+00  2.67261242e-01 -1.37281295e+00  1.22474487e+00]
 [-1.22474487e+00  1.06904497e+00  3.92232270e-01  6.79869978e-16]
 [ 0.00000000e+00 -1.33630621e+00  9.80580676e-01 -1.22474487e+00]]


In [45]:
print('our function mean',avg)
print('our function std deviation',stddev)

our function mean [1.50000000e+03 2.66666667e+00 4.16666667e+01 4.00000000e-01]
our function std deviation [4.08248290e+02 1.24721913e+00 8.49836586e+00 8.16496581e-02]


### MinMaxScaler (Normalization Technique)

In SciKit Learn MinMaxScaler use the formula below:

$$x_{scaled} = \frac{x - min(x)}{max(x)-min(x)}$$
$$ $$
This effectively scaled the data so that all the data **ranges from 0 to 1 or from -1 to 1**.

In [46]:
def minmax_scaling(X):
    """
    This function is to replicate the same method as sklearn MinMaxScaler
    Formula is x(scaled) = x - min(x) / max(x) - min(x)
    This function produce similar result as sklearn MinMaxScaler
    """
    maximum = X.max(axis=0)
    minimum = X.min(axis=0)
    range = (maximum - minimum)
    scaled = (X - minimum) / range
    return scaled, minimum, range

#### Example 1

In [47]:
# Our function that is same as MinMaxScaler
X_norm1, min, range = minmax_scaling(X_train1)
print(X_norm1)

[[1.         1.         0.         1.        ]
 [0.45047923 0.33333333 1.         0.5       ]
 [0.         0.         0.         0.        ]]


In [48]:
# MinMaxScaler in SciKit Learn
minmax_scaler = MinMaxScaler().fit(X_train1)
normalized_arr = minmax_scaler.transform(X_train1)
print(normalized_arr)

[[1.         1.         0.         1.        ]
 [0.45047923 0.33333333 1.         0.5       ]
 [0.         0.         0.         0.        ]]


#### Example 2

In [49]:
x_array2 = np.array([2,3,5,6,7,4,8,7,6]).reshape(-1,1)

In [50]:
# MinMaxScaler in SciKit Learn
minmax_scaler = MinMaxScaler().fit(x_array2)
normalized_arr = minmax_scaler.transform(x_array2)
print(normalized_arr)

[[0.        ]
 [0.16666667]
 [0.5       ]
 [0.66666667]
 [0.83333333]
 [0.33333333]
 [1.        ]
 [0.83333333]
 [0.66666667]]


In [51]:
# Our function that is same as MinMaxScaler
x_norm2, min, range = minmax_scaling(x_array2)
print(x_norm2)

[[0.        ]
 [0.16666667]
 [0.5       ]
 [0.66666667]
 [0.83333333]
 [0.33333333]
 [1.        ]
 [0.83333333]
 [0.66666667]]


#### Example 3

In [52]:
X_array3 = np.array([[2000,3,30,0.5], [1000,4,45,0.4], [1500,1,50,0.3]])

In [53]:
# MinMaxScaler in SciKit Learn
minmax_scaler = MinMaxScaler().fit(X_array3)
normalized_arr = minmax_scaler.transform(X_array3)
print(normalized_arr)

[[1.         0.66666667 0.         1.        ]
 [0.         1.         0.75       0.5       ]
 [0.5        0.         1.         0.        ]]


In [54]:
# MinMaxScaler in SciKit Learn
normalized_arr = MinMaxScaler().fit_transform(X_array3)
print(normalized_arr)

[[1.         0.66666667 0.         1.        ]
 [0.         1.         0.75       0.5       ]
 [0.5        0.         1.         0.        ]]


In [55]:
# Our function that is same as MinMaxScaler
X_norm3, min, range = minmax_scaling(X_array3)
print(X_norm3)

[[1.         0.66666667 0.         1.        ]
 [0.         1.         0.75       0.5       ]
 [0.5        0.         1.         0.        ]]


<div class="alert alert-block alert-info">

Please note that in SciKit Learn feature scaling, we use `fit()` and `transform()`. Alternatively, we can also combined both function using `fit_transform()`. 

</div>

## Normalization vs Standardization

The above mention feature scaling techniques are most commonly used. So when should we use normalization (MinMaxScaler) or standardization (StandardScaler)? The main characteristics of both techniques are as follows:

**Normalization (MinMaxScaler)**

- Use the minimum and maximum of each features for scaling.
- Scale all data to a range from 0 to 1. Or scaled the data to a range from -1 to 1.
- It is used when all the features have different scale and vary significantly. examples are sensor data or pixel values.
- It is affected by outliers
- It is useful when we don't know about the distribution.
- If the data is not Gaussian distributed, it is best to use MinMaxScaler.
- We know that the minimum and maximum number are not outliers and carry important information.
- As it is affected by the minimum and maximum number in the dataset, it is best that the data are not skewed and the data are evenly distributed between the minimum and maximum boundary.
- In distance based model such as kNN, SVM, and NN; these models are sensitive to feature range and would benefit from scaling. Also best for deep learning models.

**Standardization (StandardScaler)**

- a.k.a Z-Score Normalization
- Use the mean and standard deviation of each feature for scaling.
- Scale all data such that the mean is 0 and standard deviation is 1.
- If data is Gaussian or Normal distribution, use standardization.
- It does not have range boundaries.
- It is also affected by outliers but less so compared to normalization.
- Algorithm such as logistic regression, linear regression, principal component analysis and linear kernel in SVM; these algorithm assume Gaussian distributed data and therefor it is best to use standardization. 



- https://medium.com/@meritshot/standardization-v-s-normalization-6f93225fbd84
- https://towardsdatascience.com/normalization-vs-standardization-explained-209e84d0f81e
- https://www.secoda.co/learn/when-to-normalize-or-standardize-data
- https://www.kdnuggets.com/2020/04/data-transformation-standardization-normalization.html
- https://www.geeksforgeeks.org/normalization-vs-standardization/
- https://www.simplilearn.com/normalization-vs-standardization-article
- https://chatgpt.com/share/66fccc15-ebac-8000-b22a-1cf3847f08ce

## Pandas and Numpy Difference in Scaling

**Why The difference?**

Short Answer is **Pandas use different formula**.

More detailed explanation is as follows:

When computing standard deviation, Pandas use the **sample** formula to compute standard deviation using formula as follows:

$$s = \sqrt{\frac{\sum\limits_{i=1}^{n}(x_i - \overline{x})^2}{n - 1}}$$

where:

- $s$ is the sample standard deviation
- $\overline{x}$ is the sample mean
- $n$ is the total number of observation in the sample

For Numpy, it uses standard deviation of **population** as the formula:

$$\sigma = \sqrt{\frac{\sum\limits_{i=1}^{N}(x_i - \mu)^2}{N}}$$


where:

- $\sigma$ is the population standard deviation
- $\mu$ is the population mean
- $N$ is the total number of observation in the population

StandardScaler in Scikit Learn uses Numpy Standard Deviation. Our function uses whatever function that the data type belongs to.



### Demonstration of Differences in Scaling Between Pandas and Numpy Using Housing Data

#### Pandas Calculation Using Housing Data

In [187]:
# Load housing data
df1 = pd.read_csv('./data/housing_one_var.csv')
Xsp_train = df1['sqft'].to_frame()

In [188]:
X_normSp, avgSp, stdSp = std_norm_v1(Xsp_train)

In [189]:
X_normSp[:5]

Unnamed: 0,sqft
0,0.13001
1,-0.50419
2,0.502476
3,-0.735723
4,1.257476


In [190]:
avgSp

sqft    2000.680851
dtype: float64

In [191]:
stdSp

sqft    794.702354
dtype: float64

#### Numpy Calculation Using Housing Data

In [192]:
stdsp_scaler = StandardScaler()
normalizedSp_arr = stdsp_scaler.fit_transform(Xsp_train)

In [193]:
normalizedSp_arr[:5]

array([[ 0.13141542],
       [-0.5096407 ],
       [ 0.5079087 ],
       [-0.74367706],
       [ 1.27107075]])

In [194]:
stdsp_scaler.mean_

array([2000.68085106])

In [195]:
stdsp_scaler.scale_

array([786.20261874])

- **Note the difference in standard deviation 794.70 in Pandas vs 786.20 in Numpy. The normalized data is also differs slightly.**
- **The standard deviation is different. This is due to the type of data we feed into our own function.**

### SciKit Learn Calculation

In [196]:
type(Xsp_train)

pandas.core.frame.DataFrame

In [197]:
std_scaler1 = StandardScaler().fit(Xsp_train)
normalized_arr1 = std_scaler1.transform(Xsp_train)

In [198]:
std_scaler1.scale_

array([786.20261874])

In [199]:
Xsp_train2 = Xsp_train.to_numpy()
type(Xsp_train2)

numpy.ndarray

In [200]:
std_scaler2 = StandardScaler().fit(Xsp_train2)
normalized_arr2 = std_scaler2.transform(Xsp_train2)

In [201]:
std_scaler2.scale_

array([786.20261874])

**In SciKit Learn, there is no difference in what type of data type we feed in.**

### Manual Computation of Standard Deviation to Confirm Pandas and Numpy Formula

The following shows the manual computation to replicate the difference in computation

In [202]:
dataMean = Xsp_train.mean()
dataMean

sqft    2000.680851
dtype: float64

In [203]:
sumOfSquare = (Xsp_train - dataMean) ** 2
sumOfSquare = sumOfSquare.sum()
sumOfSquare

sqft    2.905138e+07
dtype: float64

**Compute Standard Deviation Using Population Formula**

In [204]:
pop_std_deviation = sumOfSquare/len(Xsp_train)
pop_std_deviation

sqft    618114.557718
dtype: float64

In [205]:
# Population Standard Deviation
np.sqrt(pop_std_deviation)

sqft    786.202619
dtype: float64

**Compute Standard Deviation Using Sample Formula**

In [206]:
sample_std_deviation_pd = sumOfSquare/(len(Xsp_train)-1)
sample_std_deviation_pd

sqft    631551.830712
dtype: float64

In [207]:

np.sqrt(sample_std_deviation_pd)

sqft    794.702354
dtype: float64

**The above proves that Pandas use `sample` formula to compute standard deviation.**

### Pandas and Numpy Computation of Standard Deviation

In [208]:
type(Xsp_train)

pandas.core.frame.DataFrame

In [209]:
Xsp_train.std()

sqft    794.702354
dtype: float64

In [210]:
type(Xsp_train2)

numpy.ndarray

In [211]:
Xsp_train2.std()

786.2026187430467

As shown above, the standard deviation method comes from the Numpy and Pandas object. Thus the differences occurs.

### Pandas - Standard Deviation with Degree of Freedom

**We can ask Pandas to use population computation by setting the degree of freedom (ddof) to 0. For more information please refer to the links below:**
$$$$
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.std.html
- https://towardsdatascience.com/data-normalization-with-pandas-and-scikit-learn-7c1cc6ed6475

In [212]:
# Below is Pandas Standard Deviation
Xsp_train.std()

sqft    794.702354
dtype: float64

In [213]:
# Below is Numpy Standard Deviation
Xsp_train2.std()

786.2026187430467

In [214]:
Xsp_train.std(ddof=0)

sqft    786.202619
dtype: float64

<div class="alert alert-block alert-info">

**As the standard deviation differs slightly, it does not matter very much in term of the objective of machine learning. This section just highlight the differences.**

</div>

### New Z-Score Function

**Below is the new function for z score that take care of Pandas data**

In [215]:
def std_norm_v2(X):
    """
    This function is z-score normalizer.
    Formula is x(scaled) = x - mean / {std deviation of x}
    There is similar scaler in sklearn is StandardScaler
    """
    ### the following check if data type is Series
    ### if is Series convert to data frame
    if isinstance(X, pd.Series):
        X = X.to_frame()
    
    avg = X.mean(axis=0)

    ### the following check if data type is dataframe
    if isinstance(X, pd.DataFrame):
        std = X.std(axis=0, ddof=0)
    else:
        std = X.std(axis=0)
    
    X_norm = (X-avg)/std
    
    return X_norm, avg, std

In [216]:
# Feeding pandas dataframe into own Scaling Function
x_normDf1, avgDf1, stdDf1 = std_norm_v2(Xsp_train)

In [217]:
x_normDf1.head()

Unnamed: 0,sqft
0,0.131415
1,-0.509641
2,0.507909
3,-0.743677
4,1.271071


In [218]:
stdDf1

sqft    786.202619
dtype: float64

In [219]:
# Feeding numpy into own Scaling Function
x_norm1, avg1, std1 = std_norm_v2(Xsp_train2)

In [220]:
std1

array([786.20261874])

In [221]:
x_norm1[:5]

array([[ 0.13141542],
       [-0.5096407 ],
       [ 0.5079087 ],
       [-0.74367706],
       [ 1.27107075]])

## Application of Feature Scaling Using Housing Data

Feature scaling is not only applicable to gradient descent or regression problem. It is applicable to a variety of machine learning methods.

Feature scaling is crucial for machine learning algorithms that are sensitive to the magnitude of data, like gradient based machine learning algorithm such as Regression and Neural Networks. Other distance based machine learning methods such as SVMs and K-Means will also be impacted without feature scaling.

Decision tree is one of the few machine learning methods that are not impacted by the scale.

The choice of which technique to use depends on the specific machine learning algorithm and the properties of your data.

### Prepare Data

In [222]:
df = pd.read_csv('./data/housing_one_var.csv')
df.head()

Unnamed: 0,sqft,price
0,2104,399900
1,1600,329900
2,2400,369000
3,1416,232000
4,3000,539900


In [223]:
df2 = pd.read_csv('./data/housing_two_var.txt')
df2.head()

Unnamed: 0,sqft,rm,price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


## Applying Feature Scaling: Z-Score Scaling on Housing Price Data (One Feature)

### Z-Score Scaling (Housing Data - One Feature)

In [224]:
x_train1 = df['sqft']
y_train1 = df['price']

In [225]:
xdf_norm1, avg_df1, stddev_df1 = std_norm_v2(x_train1)

In [226]:
xdf_norm1[:5]

Unnamed: 0,sqft
0,0.131415
1,-0.509641
2,0.507909
3,-0.743677
4,1.271071


In [227]:
avg_df1

sqft    2000.680851
dtype: float64

In [228]:
stddev_df1

sqft    786.202619
dtype: float64

In [229]:
x_train1 = x_train1.to_numpy().reshape(-1,1)
y_train1 = y_train1.to_numpy().reshape(-1,1)

In [230]:
x_norm1, avg1, stddev1 = std_norm_v2(x_train1)

In [231]:
x_norm1[:5]

array([[ 0.13141542],
       [-0.5096407 ],
       [ 0.5079087 ],
       [-0.74367706],
       [ 1.27107075]])

In [232]:
avg1

array([2000.68085106])

In [233]:
stddev1

array([786.20261874])

In [234]:
std_scaler1 = StandardScaler().fit(x_train1)
normalized_arr1 = std_scaler1.transform(x_train1)

In [235]:
normalized_arr1[:5]

array([[ 0.13141542],
       [-0.5096407 ],
       [ 0.5079087 ],
       [-0.74367706],
       [ 1.27107075]])

In [236]:
std_scaler1.mean_

array([2000.68085106])

In [237]:
std_scaler1.scale_

array([786.20261874])

### Running Gradient Descent

In [238]:
coef1, intercept1, _, _, _ = my.compute_gradient_descent(x_norm1, y_train1)

iteration 9999: Last cost = 2.0581e+09: intercept = 3.4041e+05: weights = [[105764.13349281]]
best w [[105764.1335]]
best b 340412.6596


In [239]:
# SciKit Linear regression with standard scaling
std_scaler1 = StandardScaler().fit(x_train1)
normalized_arr1 = std_scaler1.transform(x_train1)
reg1 = LinearRegression().fit(normalized_arr1, y_train1)
print('w',reg1.coef_)
print('b',reg1.intercept_)

w [[105764.13349282]]
b [340412.65957447]


In [240]:
# SciKit Linear Regression without scaling
lr1 = LinearRegression().fit(x_train1, y_train1)
print('w',lr1.coef_)
print('b',lr1.intercept_)

w [[134.52528772]]
b [71270.49244873]


<div class="alert alert-block alert-info">
Please note that the best coefficient differs between scaled and non scaled data.

</div>

### Same Prediction for Scaled and Unscaled Data

**The following is my prediction using normalized data**

In [241]:
my_predict1 = my.prediction(x_norm1, intercept1, coef1)
my_predict1[:10]

array([[354311.69781211],
       [286510.95280111],
       [394131.18297731],
       [261758.29986059],
       [474846.35560945],
       [338303.18857341],
       [277632.28381158],
       [263238.07802551],
       [256915.38950266],
       [272251.27230277]])

**The following is SciKit Learn prediction using normalized data**

In [242]:
sk_predict1 = reg1.predict(normalized_arr1)
sk_predict1[:10]

array([[354311.69781212],
       [286510.95280112],
       [394131.18297731],
       [261758.29986059],
       [474846.35560945],
       [338303.18857341],
       [277632.28381158],
       [263238.07802551],
       [256915.38950266],
       [272251.27230277]])

**The following is SciKit Learn prediction using normal data.**

In [243]:
normal_predict1 = lr1.predict(x_train1)
normal_predict1[:10]

array([[354311.69781212],
       [286510.95280112],
       [394131.18297731],
       [261758.29986059],
       [474846.35560945],
       [338303.18857341],
       [277632.28381158],
       [263238.07802551],
       [256915.38950266],
       [272251.27230277]])

**Although the weights are different between scaled and non-scaled data. Prediction should be the same. We can use prediction to compare the results.**

### Applying One Feature Scaling Using DataFrame

In [244]:
x_train2 = df['sqft'].to_frame()
y_train2 = df['price'].to_frame()

In [245]:
x_norm2, avg2, stddev2 = std_norm_v2(x_train2)

In [246]:
stddev2

sqft    786.202619
dtype: float64

In [247]:
std_scaler2 = StandardScaler().fit(x_train2.values)
normalized_arr2 = std_scaler2.transform(x_train2.values)

In [248]:
std_scaler2.scale_

array([786.20261874])

In [249]:
x_norm2[:5]

Unnamed: 0,sqft
0,0.131415
1,-0.509641
2,0.507909
3,-0.743677
4,1.271071


In [250]:
normalized_arr2[:5]

array([[ 0.13141542],
       [-0.5096407 ],
       [ 0.5079087 ],
       [-0.74367706],
       [ 1.27107075]])

In [251]:
# GD using own function
coef2, intercept2, _, _, _ = my.compute_gradient_descent(x_norm2, y_train2)

iteration 9999: Last cost = 2.0581e+09: intercept = 3.4041e+05: weights = [[105764.13349281]]
best w [[105764.1335]]
best b 340412.6596


In [252]:
# SciKit Learn Regression using scaled data
reg2 = LinearRegression().fit(normalized_arr2, y_train2)
print('w',reg2.coef_)
print('b',reg2.intercept_)

w [[105764.13349282]]
b [340412.65957447]


In [253]:
# SciKit Linear Regression without scaling
lr2 = LinearRegression().fit(x_train2, y_train2)
print('w',lr2.coef_)
print('b',lr2.intercept_)

w [[134.52528772]]
b [71270.49244873]


### Predicting Housing Price 

#### Predicting Housing Price Using Existing Data

Let us use the first data as the form of query

In [254]:
x_train2[:5]

Unnamed: 0,sqft
0,2104
1,1600
2,2400
3,1416
4,3000


In [255]:
y_train2[:5]

Unnamed: 0,price
0,399900
1,329900
2,369000
3,232000
4,539900


In [256]:
coef2, intercept2, _, _, _ = my.compute_gradient_descent(x_norm2, y_train2)

iteration 9999: Last cost = 2.0581e+09: intercept = 3.4041e+05: weights = [[105764.13349281]]
best w [[105764.1335]]
best b 340412.6596


In [257]:
myAsk = 2104
myAskOne = np.array(myAsk).reshape((1,1))

In [258]:
result2 = my.prediction(myAskOne, intercept2, coef2)
result2

array([[2.2286815e+08]])

In [259]:
print('Predicted housing price of {0} sqft is: ${1:,.2f}'.format(myAsk, result2[0][0])) 
#array cannot fit into string format use [0]

Predicted housing price of 2104 sqft is: $222,868,149.53


The result is off as the predicted price should be around $399,900

In [260]:
y_train2[:1]

Unnamed: 0,price
0,399900


This is because our data is not normalize.

#### Using Normalized Data for Prediction

To predict with normalize data:

In [268]:
firstData = x_norm2.iloc[0,0]
firstData = firstData.reshape((1,1))
firstData

array([[0.13141542]])

In [269]:
my_predictA = my.prediction(firstData, intercept2, coef2)

In [270]:
my_predictA

array([[354311.69781211]])

Comparing against prediction without scaling 

In [271]:
# Just predict the first 10 data
my_predict2 = lr2.predict(x_train2[:10])

In [272]:
# result from first prediction
my_predict2[:1]

array([[354311.69781212]])

#### Using Data not in Existing Dataset

2104 sqft is the first data on our training dataset. What happen if we want to predict numbers that are not in the training set? Let say we want to predict 2800 sqft

In [273]:
x_train2[:5]

Unnamed: 0,sqft
0,2104
1,1600
2,2400
3,1416
4,3000


In [274]:
y_train2[:5]

Unnamed: 0,price
0,399900
1,329900
2,369000
3,232000
4,539900


Based on the data above, is 2800 sqft house should range between housing price with size of 2400 sqft and 3000 sqft. Thus our prediction should be between 369,000 and 539,900.

To normalized new data we need to get the mean and standard deviation used in normalizing the data. This is the reason, we also return the mean and standard deviation in our function.

In [275]:
avg2

sqft    2000.680851
dtype: float64

In [276]:
stddev2

sqft    786.202619
dtype: float64

In [277]:
myAskTwo = np.array([2800]).reshape(1,-1)

In [278]:
myAskTwo.shape

(1, 1)

In [279]:
myAskTwoNorm = (myAskTwo[0] - avg2) / stddev2
myAskTwoNorm

sqft    1.016683
dtype: float64

In [280]:
my_predictB = my.prediction(myAskTwoNorm, intercept2, coef2)
my_predictB

array([447941.2980654])

Our prediction of 447,941 is between the number 369,000 and 539,900.

Please note that for SciKit Learn, the StandardScaler function already remember the scaling factor when we pass the command  

`std_scaler = StandardScaler().fit(x_train)`

In the command above `std_scaler` is the variable name of the scaler we use on `x_train`.

In [281]:
reg2.predict(myAskTwo) # still give big figures because our input is not normalized

array([[2.96479986e+08]])

To convert to scaled data, we use `std_scaler.transform()` to scale the data. The feature scaling variables will remember the average and standard deviation.  

In [282]:
# The following is the mean previously computed
std_scaler2.mean_

array([2000.68085106])

In [283]:
# The following is the standard deviation previously computed
std_scaler2.scale_

array([786.20261874])

In [284]:
myAskTwoRegNorm = std_scaler2.transform(myAskTwo)

In [285]:
reg2.predict(myAskTwoRegNorm)

array([[447941.2980654]])

The following is SciKit Learn non-scaled data, the prediction should be the same.

In [286]:
# SciKit Linear Regression without scaling
lr2 = LinearRegression().fit(x_train2, y_train2)
print('w',lr2.coef_)
print('b',lr2.intercept_)

w [[134.52528772]]
b [71270.49244873]


In [287]:
lr2.predict(myAskTwo)



array([[447941.2980654]])

The above warning happen because we use Pandas dataframe for training. To avoid this warning, when using SciKit learn use x_train.values instead.

In [288]:
# SciKit Linear Regression without scaling
lr2 = LinearRegression().fit(x_train2.values, y_train2)
print('w',lr2.coef_)
print('b',lr2.intercept_)

w [[134.52528772]]
b [71270.49244873]


In [289]:
lr2.predict(myAskTwo)

array([[447941.2980654]])

## Applying Feature Scaling: Z Score Scaling on Housing Price Data (Multiple Regression)

In [290]:
X2_train3 = df2[['sqft','rm']]
y2_train3 = df2['price']
X2_train3 = X2_train3.to_numpy()
y2_train3 = y2_train3.to_numpy()

In [291]:
# SciKit Learn Regression without Scaling
lr3 = LinearRegression().fit(X2_train3, y2_train3)
print('w',lr3.coef_)
print('b',lr3.intercept_)

w [  139.21067402 -8738.01911233]
b 89597.90954279748


In [292]:
# Own scaling function with own gradient descent function
X_norm3, avg3, stddev3 = std_norm_v2(X2_train3)
coef3, intercept3, _, _, _ = my.compute_gradient_descent(X_norm3, y2_train3)

iteration 9999: Last cost = 2.0433e+09: intercept = 3.4041e+05: weights = [[109447.79646964  -6578.35485416]]
best w [[109447.7965]
 [ -6578.3549]]
best b 340412.6596


In [293]:
# SciKit Learn Regression with Scaling
std_scaler3 = StandardScaler()
normalized_arr3 = std_scaler3.fit_transform(X2_train3)
reg3 = LinearRegression().fit(normalized_arr3, y2_train3)
print('w',reg3.coef_)
print('b',reg3.intercept_)

w [109447.79646964  -6578.35485416]
b 340412.6595744681


In [294]:
normal_predict3 = lr3.predict(X2_train3)
normal_predict3[:10]

array([356283.1103389 , 286120.93063401, 397489.46984812, 269244.1857271 ,
       472277.85514636, 330979.02101847, 276933.02614885, 262037.48402897,
       255494.58235014, 271364.59918815])

In [295]:
sk_predict3 = reg3.predict(normalized_arr3)
sk_predict3[:10]

array([356283.1103389 , 286120.93063401, 397489.46984812, 269244.1857271 ,
       472277.85514636, 330979.02101847, 276933.02614885, 262037.48402897,
       255494.58235014, 271364.59918815])

In [296]:
my_predict3 = my.prediction(X_norm3, intercept3, coef3)
my_predict3[:10]

array([[356283.11033889],
       [286120.93063401],
       [397489.46984811],
       [269244.1857271 ],
       [472277.85514636],
       [330979.02101847],
       [276933.02614885],
       [262037.48402896],
       [255494.58235014],
       [271364.59918814]])

In [297]:
print('avg shape:', avg3.shape)
print('std dev shape:', stddev3.shape)
print('intercept shape:', intercept3.shape)
print('coef shape:', coef3.shape)

avg shape: (2,)
std dev shape: (2,)
intercept shape: ()
coef shape: (2, 1)


In [298]:
avg3 = avg3.reshape((1,-1))
stddev3 = stddev3.reshape(1,-1)

In [299]:
myAskThree = np.array([2800, 3]).reshape((1,2))

In [300]:
myAskThreeNorm = (myAskThree - avg3)/stddev3
myAskThreeNorm

array([[ 1.0166834 , -0.22609337]])

In [301]:
result3 = my.prediction(myAskThreeNorm, intercept3, coef3)
result3

array([[453173.73945516]])

In [302]:
myAskThreeNormReg = std_scaler3.transform(myAskThree)
myAskThreeNormReg

array([[ 1.0166834 , -0.22609337]])

In [303]:
reg3.predict(myAskThreeNormReg)

array([453173.73945517])

In [304]:
lr3.predict(myAskThree)

array([453173.73945517])

Similarly, for multiple feature regression, we still cannot be accurate prediction using un-normalized ask. 

In [305]:
reg3.predict(myAskThree)

array([3.06774508e+08])

## Applying Feature Scaling: MinMax Scaling on Housing Price Data (One Feature)

### MinMax Scaling (Housing Data - One Feature)

In [306]:
x_train1b = df['sqft']
y_train1b = df['price']
x_train1b = x_train1b.to_numpy().reshape(-1,1)
y_train1b = y_train1b.to_numpy().reshape(-1,1)

In [307]:
x_norm1b, min1b, range1b = minmax_scaling(x_train1b)

In [308]:
x_norm1b[:5]

array([[0.34528406],
       [0.20628792],
       [0.42691671],
       [0.1555433 ],
       [0.59238831]])

In [309]:
min1b

array([852])

In [310]:
range1b

array([3626])

In [311]:
minmax_scaler1b = MinMaxScaler().fit(x_train1b)
normalized_arr1b = minmax_scaler1b.transform(x_train1b)

In [312]:
normalized_arr1b[:5]

array([[0.34528406],
       [0.20628792],
       [0.42691671],
       [0.1555433 ],
       [0.59238831]])

In [313]:
minmax_scaler1b.data_max_

array([4478.])

In [314]:
minmax_scaler1b.data_min_

array([852.])

In [315]:
minmax_scaler1b.data_range_

array([3626.])

### Running Gradient Descent

In [316]:
coef1b, intercept1b, _, _, _ = my.compute_gradient_descent(x_norm1b, y_train1b, iterations=200000)

iteration 199999: Last cost = 2.0581e+09: intercept = 1.8589e+05: weights = [[487788.69327352]]
best w [[487788.6933]]
best b 185886.0376


In [317]:
# SciKit Linear regression with MinMax scaling
minmax_scaler1b = MinMaxScaler().fit(x_train1b)
normalized_arr1b = minmax_scaler1b.transform(x_train1b)
reg1b = LinearRegression().fit(normalized_arr1b, y_train1b)
print('w',reg1b.coef_)
print('b',reg1b.intercept_)

w [[487788.6932736]]
b [185886.03758637]


In [318]:
# SciKit Linear Regression without scaling
lr1b = LinearRegression().fit(x_train1b, y_train1b)
print('w',lr1b.coef_)
print('b',lr1b.intercept_)

w [[134.52528772]]
b [71270.49244873]


Please note that the best coefficient differs between scaled and non scaled data.

### Same Prediction for Scaled and Unscaled Data

**The following is my prediction using normalized data**

In [319]:
my_predict1b = my.prediction(x_norm1b, intercept1b, coef1b)
my_predict1b[:10]

array([[354311.69781212],
       [286510.95280112],
       [394131.1829773 ],
       [261758.2998606 ],
       [474846.35560943],
       [338303.18857341],
       [277632.28381159],
       [263238.07802553],
       [256915.38950268],
       [272251.27230278]])

**The following is SciKit Learn prediction using normalized data**

In [320]:
sk_predict1b = reg1b.predict(normalized_arr1b)
sk_predict1b[:10]

array([[354311.69781212],
       [286510.95280112],
       [394131.18297731],
       [261758.29986059],
       [474846.35560945],
       [338303.18857341],
       [277632.28381158],
       [263238.07802551],
       [256915.38950266],
       [272251.27230277]])

**The following is SciKit Learn prediction using normal data.**

In [321]:
normal_predict1b = lr1b.predict(x_train1)
normal_predict1b[:10]

array([[354311.69781212],
       [286510.95280112],
       [394131.18297731],
       [261758.29986059],
       [474846.35560945],
       [338303.18857341],
       [277632.28381158],
       [263238.07802551],
       [256915.38950266],
       [272251.27230277]])

**Although the weights are different between scaled and non-scaled data. Prediction should be the same. We can use prediction to compare the results.**

### Predicting Housing Price 

#### Predicting Housing Price Using Existing Data

In [322]:
coef1b

array([[487788.69327352]])

In [323]:
intercept1b

185886.03758639883

In [324]:
myAsk = 2104
myAskOne = np.array(myAsk).reshape((1,1))

In [325]:
result1b = my.prediction(myAskOne, intercept1b, coef1b)
result1b

array([[1.0264933e+09]])

In [326]:
print('Predicted housing price of {0} sqft is: ${1:,.2f}'.format(myAsk, result1b[0][0])) 
#array cannot fit into string format use [0]

Predicted housing price of 2104 sqft is: $1,026,493,296.69


In [327]:
y_train1[0]

array([399900])

This is because our data is not normalize.

#### Using Normalized Data for Prediction

To predict with normalize data:

In [328]:
firstData = normalized_arr1b[0]
firstData = firstData.reshape((1,1))
firstData

array([[0.34528406]])

In [329]:
my_predictA1 = my.prediction(firstData, intercept1b, coef1b)
my_predictA1

array([[354311.69781212]])

Comparing against prediction without scaling 

In [330]:
# Just predict the first 10 data
my_predictA2 = lr1b.predict(x_train1b[:10])

In [331]:
# result from first prediction
my_predictA2[:1]

array([[354311.69781212]])

#### Using Data not in Existing Dataset

Based on the dataset, 2800 sqft house should  range should be between 2400 sqft and 3000 sqft. Thus our prediction should be between 369,000 and 539,900.

To normalized new data we need to get the minimum and the range between maximum and minimum used in normalizing the data. This is the reason, we also return the minimum and range in our function.

In [332]:
myAskTwo = np.array([2800]).reshape(1,-1)

In [333]:
myAskTwoNorm = (myAskTwo[0] - min1b) / range1b
myAskTwoNorm

array([0.53723111])

In [334]:
my_predictB = my.prediction(myAskTwoNorm, intercept1b, coef1b)
my_predictB

array([447941.29806539])

Our prediction of 447,941 is between the number 369,000 and 539,900.

For SciKit Learn, the MinMaxScaler() will remember the variable. To convert to scaled data, we use `minmax_scaler.transform()` to scale the data.

In [335]:
myAskTwoRegNorm = minmax_scaler1b.transform(myAskTwo)

In [336]:
reg1b.predict(myAskTwoRegNorm)

array([[447941.2980654]])

The following is SciKit Learn non-scaled data, the prediction should be the same.

In [337]:
lr1b.predict(myAskTwo)

array([[447941.2980654]])

## Applying Feature Scaling: MinMax Scaling on Housing Price Data (Multiple Regression)

In [338]:
# SciKit Learn Regression without Scaling
lr2b = LinearRegression().fit(X2_train3, y2_train3)
print('w',lr2b.coef_)
print('b',lr2b.intercept_)

w [  139.21067402 -8738.01911233]
b 89597.90954279748


In [339]:
# Own scaling function with own gradient descent function
X_norm2b, min2b, range2b = minmax_scaling(X2_train3)
coef2b, intercept2b, _, _, _ = my.compute_gradient_descent(X_norm2b, y2_train3, iterations=200000)

iteration 199999: Last cost = 2.0433e+09: intercept = 1.9947e+05: weights = [[504777.90398781 -34952.07644922]]
best w [[504777.904 ]
 [-34952.0764]]
best b 199467.3847


In [340]:
# SciKit Learn Regression with Scaling
minmax_scaler2b = MinMaxScaler()
normalized_arr2b = minmax_scaler2b.fit_transform(X2_train3)
reg2b = LinearRegression().fit(normalized_arr2b, y2_train3)
print('w',reg2b.coef_)
print('b',reg2b.intercept_)

w [504777.90398791 -34952.07644931]
b 199467.38469348656


#### Predicting Housing Price Using Existing Data

In [341]:
normal_predict2b = lr2b.predict(X2_train3)
normal_predict2b[:10]

array([356283.1103389 , 286120.93063401, 397489.46984812, 269244.1857271 ,
       472277.85514636, 330979.02101847, 276933.02614885, 262037.48402897,
       255494.58235014, 271364.59918815])

In [342]:
sk_predict2b = reg2b.predict(normalized_arr2b)
sk_predict2b[:10]

array([356283.1103389 , 286120.93063401, 397489.46984812, 269244.1857271 ,
       472277.85514636, 330979.02101847, 276933.02614885, 262037.48402897,
       255494.58235014, 271364.59918815])

In [343]:
my_predict2b = my.prediction(X_norm2b, intercept2b, coef2b)
my_predict2b[:10]

array([[356283.11033889],
       [286120.93063402],
       [397489.4698481 ],
       [269244.18572709],
       [472277.85514635],
       [330979.02101849],
       [276933.02614886],
       [262037.48402898],
       [255494.58235015],
       [271364.59918815]])

#### Using Data not in Existing Dataset

In [344]:
min2b = min2b.reshape((1,-1))
range2b = range2b.reshape(1,-1)

In [345]:
myAskThree = np.array([2800, 3]).reshape((1,2))

In [346]:
myAskThreeNorm = (myAskThree - min2b)/range2b
myAskThreeNorm

array([[0.53723111, 0.5       ]])

In [347]:
result2b = my.prediction(myAskThreeNorm, intercept2b, coef2b)
result2b

array([[453173.73945514]])

In [348]:
myAskThreeNormReg = minmax_scaler2b.transform(myAskThree)
myAskThreeNormReg

array([[0.53723111, 0.5       ]])

In [349]:
reg2b.predict(myAskThreeNormReg)

array([453173.73945517])

In [350]:
lr3.predict(myAskThree)

array([453173.73945517])

Similarly, for multiple feature regression, we still cannot be accurate prediction using un-normalized ask. 

In [351]:
reg2b.predict(myAskThree)

array([1.41347274e+09])

## End Note 10