I’ve been reading information about scaling inputs before we create a model (This is something that I do directly in some of the regularization functions with them that allow me this with a parameter), then, here I sent what I concluded:
•	In standard linear regression is not necessary scale the input data.
•	In regularization, clustering and PCA it’s important scaling the data before fitting.
•	We must scale the data after we have done training-test split, to avoid information leak of the training in the test (often leading to overly optimistic results during evaluation).
•	We scale the test data with the information of the training data .
•	We can scale the input and output variables, but normally, we scale only input variables.
•	There are many methods to scale the data, but the most known ones are normalization and standardization, but the last one requires a normal distribution of every variable, we can try both and evaluate results.
•	If we expect some outliers, RobustScaler works better.
•	Use pipelines option of sklearn when we have to do more than one step in preprocessing the data before fitting.
This is the source of the online information.
•	When and Why to Standardize Your Data | Built In (https://builtin.com/data-science/when-and-why-standardize-your-data)
•	How to use Data Scaling Improve Deep Learning Model Stability and Performance - MachineLearningMastery.com (https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/)
•	When, Why, And How You Should Standardize Your Data (machinelearningcompass.com) (https://machinelearningcompass.com/dataset_optimization/standardization/)
•	Standardization in case of real-time predictions | element61 (https://www.element61.be/en/resource/standardization-case-real-time-predictions)
•	predictive models - Standardization and prediction on new data - Cross Validated (stackexchange.com) (https://stats.stackexchange.com/questions/214331/standardization-and-prediction-on-new-data)
•	How and why to Standardize your data: A python tutorial (https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832)


# Data scaling
In this notebook we will explore how to scale the data, something mandatory when we need to apply some machine learging methods like PCA, clustering, K-nearest neighboors, Support vector machine, etc...
Differences in the scales across input variables may increase the difficulty of the problem being modeled. An example of this is that large input values (e.g. a spread of hundreds or thousands of units) can result in a model that learns large weight values. A model with large weight values is often unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values resulting in higher generalization error. (https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/)
 
I will use scaling methods of Scikit-Learn and i will test the answer with manual procedure.
Note: Most of the methods receive List of lists (Being each list one row) and Data frame, applying scaling through columns.

## Data Normalization

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

A value is normalized as follows:
$$
x_{scaled} = (x - x_{min})/(x_{max} - x_{min})
$$

In [6]:
# example of a normalization
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
# define data
data = asarray([[100, 0.001],
 [8, 0.05],
 [50, 0.005],
 [88, 0.07],
 [4, 0.1]])
print(data)
# define min max scaler
scaler = MinMaxScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)
print('Minimos',scaler.data_min_)
print('Máximos',scaler.data_max_)

[[1.0e+02 1.0e-03]
 [8.0e+00 5.0e-02]
 [5.0e+01 5.0e-03]
 [8.8e+01 7.0e-02]
 [4.0e+00 1.0e-01]]
[[1.         0.        ]
 [0.04166667 0.49494949]
 [0.47916667 0.04040404]
 [0.875      0.6969697 ]
 [0.         1.        ]]
Minimos [4.e+00 1.e-03]
Máximos [100.    0.1]


In [2]:
import pandas as pd
DataTable2 = pd.DataFrame(data,columns=['a','b'])
scaled2 = scaler.fit_transform(DataTable2)
scaledtable = pd.DataFrame(scaled2,columns=['a','b'])
print(DataTable2)
print(scaled2)
print(scaledtable)
print('Minimos',scaler.data_min_)
print('Máximos',scaler.data_max_)

       a      b
0  100.0  0.001
1    8.0  0.050
2   50.0  0.005
3   88.0  0.070
4    4.0  0.100
[[1.         0.        ]
 [0.04166667 0.49494949]
 [0.47916667 0.04040404]
 [0.875      0.6969697 ]
 [0.         1.        ]]
          a         b
0  1.000000  0.000000
1  0.041667  0.494949
2  0.479167  0.040404
3  0.875000  0.696970
4  0.000000  1.000000
Minimos [4.e+00 1.e-03]
Máximos [100.    0.1]


In [3]:
for i in DataTable2.columns:
    print(i)

a
b


In [4]:
# create scaler with diferent minimum and maximum
scaler = MinMaxScaler(feature_range=(-1,1))
scaler.fit(data)
# apply transform
normalized = scaler.transform(data)
print(normalized)
# inverse transform
inverse = scaler.inverse_transform(normalized)
print(inverse)

[[ 1.         -1.        ]
 [-0.91666667 -0.01010101]
 [-0.04166667 -0.91919192]
 [ 0.75        0.39393939]
 [-1.          1.        ]]
[[1.0e+02 1.0e-03]
 [8.0e+00 5.0e-02]
 [5.0e+01 5.0e-03]
 [8.8e+01 7.0e-02]
 [4.0e+00 1.0e-01]]


In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(data)
standardized = scaler.transform(data)
print(standardized)
print('Media',scaler.mean_)
print('Varianza',scaler.var_)

[[ 1.26398112 -1.16389967]
 [-1.06174414  0.12639634]
 [ 0.         -1.05856939]
 [ 0.96062565  0.65304778]
 [-1.16286263  1.44302493]]
Media [5.00e+01 4.52e-02]
Varianza [1.56480e+03 1.44216e-03]


In [6]:
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print(df)
data_mean = np.mean(df)
data_std = np.std(df)
df_standardized = (df-data_mean)/data_std
print(df_standardized)
print('Media',data_mean)
print('Varianza',data_std**2)

       0      1
0  100.0  0.001
1    8.0  0.050
2   50.0  0.005
3   88.0  0.070
4    4.0  0.100
          0         1
0  1.263981 -1.163900
1 -1.061744  0.126396
2  0.000000 -1.058569
3  0.960626  0.653048
4 -1.162863  1.443025
Media 0    50.0000
1     0.0452
dtype: float64
Varianza 0    1564.800000
1       0.001442
dtype: float64


  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


In [7]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
scaler.fit(data)
robust_scaled = scaler.transform(data)
print(robust_scaled)
print('escale',scaler.scale_)
print('Mediana',scaler.center_)

[[ 0.625      -0.75384615]
 [-0.525       0.        ]
 [ 0.         -0.69230769]
 [ 0.475       0.30769231]
 [-0.575       0.76923077]]
escale [8.0e+01 6.5e-02]
Mediana [50.    0.05]


In [9]:
cuartiles = df.quantile([.25, .50, .75]).values
cuartiles = pd.DataFrame(cuartiles)
Mediana = cuartiles.iloc[1]
Q1 = cuartiles.iloc[0]
Q3 = cuartiles.iloc[2]
RIC = Q3-Q1
print(RIC)
robust_scaled = (df-Mediana)/RIC
print(robust_scaled)

0    80.000
1     0.065
dtype: float64
       0         1
0  0.625 -0.753846
1 -0.525  0.000000
2  0.000 -0.692308
3  0.475  0.307692
4 -0.575  0.769231
