
### Feature Scaling

In machine learning models, features are mapped into n-dimensional space. For example, if there are two variables (x, y) mapped in a 2D coordinate system, and one variable (y) is very large while the other (x) is very small, the Euclidean distance will be dominated by the larger variable, causing the smaller one to be ignored. This results in the loss of valuable information. Feature scaling is used to solve this problem.

Additional reasons for transformation:
1. To more closely approximate a theoretical distribution with nice statistical properties.
2. To spread out data more evenly.
3. To make data distribution more symmetric.
4. To make relationships between variables more linear.
5. To make data more constant in variance (homoscedasticity).

#### There are three most used ways to scale features:

1. **Min-Max Scaling**: 
   Will scale the input to have a minimum of 0 and a maximum of 1. This scales the data in the range of [0, 1]. This is useful when the parameters need to be on the same positive scale. However, outliers can be lost.
   $$X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

2. **Standardization**: 
   Will scale the input to have a mean of 0 and a variance of 1.
   $$X_{stand} = \frac{X - \mu}{\sigma}$$

3. **Normalizing**: 
   Will scale the input to make the norm of 1. For instance, for 3D data, the three independent variables will lie on a unit sphere.

4. **Log Transformation**: 
   Taking the log of data after any of the above transformations.

Scaling inputs to unit norms is a common operation for text classification or clustering. For instance, the dot product of two L2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community. 

For most applications, standardization is recommended. Min-Max Scaling is recommended for neural networks, and normalizing is recommended when clustering, e.g., k-means..
    

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler


   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [6]:
# Read the CSV file 'Data.csv' into a DataFrame and drop any rows with missing values
df = pd.read_csv('Data.csv').dropna()
print(df)
# Extract the 'Age' and 'Salary' columns, convert them to a NumPy array of type float64
x = df[["Age", "Salary"]].values.astype(np.float64)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [9]:
standard_scaler = StandardScaler()
normalizer = Normalizer()
min_max_scaler = MinMaxScaler()

print(f"Standardization:\n {standard_scaler.fit_transform(x)}")
print(f"Normalizing: \n {normalizer.fit_transform(x)}")
print(f"MinMax Scaling: \n {min_max_scaler.fit_transform(x)}")

Standardization:
 [[ 0.69985807  0.58989097]
 [-1.51364653 -1.50749915]
 [-1.12302807 -0.98315162]
 [-0.08137885 -0.37141284]
 [-0.47199731 -0.6335866 ]
 [ 1.22068269  1.20162976]
 [ 1.48109499  1.55119478]
 [-0.211585    0.1529347 ]]
Normalizing: 
 [[6.11110997e-04 9.99999813e-01]
 [5.62499911e-04 9.99999842e-01]
 [5.55555470e-04 9.99999846e-01]
 [6.22950699e-04 9.99999806e-01]
 [6.03448166e-04 9.99999818e-01]
 [6.07594825e-04 9.99999815e-01]
 [6.02409529e-04 9.99999819e-01]
 [5.52238722e-04 9.99999848e-01]]
MinMax Scaling: 
 [[0.73913043 0.68571429]
 [0.         0.        ]
 [0.13043478 0.17142857]
 [0.47826087 0.37142857]
 [0.34782609 0.28571429]
 [0.91304348 0.88571429]
 [1.         1.        ]
 [0.43478261 0.54285714]]
