In [17]:
from sklearn import datasets, linear_model
import timeit

#It returns the data and target 
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
raw = diabetes_X[:, None, 2]
max_raw = max(raw)

In [18]:
print(raw)

[[ 0.06169621]
 [-0.05147406]
 [ 0.04445121]
 [-0.01159501]
 [-0.03638469]
 [-0.04069594]
 [-0.04716281]
 [-0.00189471]
 [ 0.06169621]
 [ 0.03906215]
 [-0.08380842]
 [ 0.01750591]
 [-0.02884001]
 [-0.00189471]
 [-0.02560657]
 [-0.01806189]
 [ 0.04229559]
 [ 0.01211685]
 [-0.0105172 ]
 [-0.01806189]
 [-0.05686312]
 [-0.02237314]
 [-0.00405033]
 [ 0.06061839]
 [ 0.03582872]
 [-0.01267283]
 [-0.07734155]
 [ 0.05954058]
 [-0.02129532]
 [-0.00620595]
 [ 0.04445121]
 [-0.06548562]
 [ 0.12528712]
 [-0.05039625]
 [-0.06332999]
 [-0.03099563]
 [ 0.02289497]
 [ 0.01103904]
 [ 0.07139652]
 [ 0.01427248]
 [-0.00836158]
 [-0.06764124]
 [-0.0105172 ]
 [-0.02345095]
 [ 0.06816308]
 [-0.03530688]
 [-0.01159501]
 [-0.0730303 ]
 [-0.04177375]
 [ 0.01427248]
 [-0.00728377]
 [ 0.0164281 ]
 [-0.00943939]
 [-0.01590626]
 [ 0.0250506 ]
 [-0.04931844]
 [ 0.04121778]
 [-0.06332999]
 [-0.06440781]
 [-0.02560657]
 [-0.00405033]
 [ 0.00457217]
 [-0.00728377]
 [-0.0374625 ]
 [-0.02560657]
 [-0.02452876]
 [-0.01806

In [19]:
print(max_raw)

[0.17055523]


In [20]:
min_raw = min(raw)
scaled = (2*raw - max_raw - min_raw)/(max_raw - min_raw)

def train_raw():
    linear_model.LinearRegression().fit(raw, diabetes_y)

def train_scaled():
    linear_model.LinearRegression().fit(scaled, diabetes_y)

    
   

In [21]:
 scaled_time = timeit.timeit(train_scaled, number=1000)

In [22]:
 raw_time = timeit.timeit(train_raw, number=1000)

In [23]:
print(raw_time)

1.2926446999999825


In [24]:
print(scaled_time)

1.3418534000002182


Another important reason for scaling is that some machine learning
algorithms and techniques are very sensitive to the relative magnitudes of
the different features. For example, a k-means clustering algorithm that
uses the Euclidean distance as its proximity measure will end up relying
heavily on features with larger magnitudes. Lack of scaling also affects the
efficacy of L1 or L2 regularization since the magnitude of weights for a
feature depends on the magnitude of values of that feature, and so
different features will be affected differently by regularization. By scaling
all features to lie between [–1, 1], we ensure that there is not much of a
difference in the relative magnitudes of different features.

Min-max scaling
The numeric value is linearly scaled so that the minimum value that
the input can take is scaled to –1 and the maximum possible value to 1:
x1_scaled = (2*x1 - max_x1 - min_x1)/(max_x1 - min_x1)
The problem with min-max scaling is that the maximum and minimum
value (max_x1 and min_x1) have to be estimated from the training
dataset, and they are often outlier values. The real data often gets
shrunk to a very narrow range in the [–1, 1] band.

Clipping (in conjunction with min-max scaling)

Helps address the problem of outliers by using “reasonable” values
instead of estimating the minimum and maximum from the training
dataset. The numeric value is linearly scaled between these two
reasonable bounds, then clipped to lie in the range [–1, 1]. This has the
effect of treating outliers as –1 or 1.

Z-score normalization

Addresses the problem of outliers without requiring prior knowledge
of what the reasonable range is by linearly scaling the input using the
mean and standard deviation estimated over the training dataset:
x1_scaled = (x1 - mean_x1)/stddev_x1
The name of the method reflects the fact that the scaled value has zero
mean and is normalized by the standard deviation so that it has unit
variance over the training dataset. The scaled value is unbounded, but
does lie between [–1, 1] the majority of the time (67%, if the
underlying distribution is normal). Values outside this range get rarer
the larger their absolute value gets, but are still present.
