# Normalization

Normalization is a scaling technique often applied as part of data preparation for machine learning.

The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.

Normalization scales each numeric input variable separately to the range [0.00 .. 1.00], which is the range for floating-point values where we have the most precision. 

Therefore, Normalization of features is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0.00 and 1.00.

This technique is also known as Min-Max scaling.

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution.


In [1]:
from pyspark.sql import SparkSession

spark=SparkSession.builder.appName("pyspark-ml-normalization").getOrCreate()

# Formula for normalization
$$ {\over X_i} = \frac {X_i - X_{min}} {X_{max} - X_{min}} $$

In [2]:
df = spark.createDataFrame([ (100, 77560, 45),
                             (200, 41560, 23),
                             (300, 30285, 20),
                             (400, 10345, 6),
                             (500, 88000, 50)
                           ], ["user_id", "revenue","num_of_days"])

In [3]:
print("Before Scaling :")
df.show(5)

Before Scaling :
+-------+-------+-----------+
|user_id|revenue|num_of_days|
+-------+-------+-----------+
|    100|  77560|         45|
|    200|  41560|         23|
|    300|  30285|         20|
|    400|  10345|          6|
|    500|  88000|         50|
+-------+-------+-----------+



In [4]:
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

In [5]:
# UDF for converting column type from vector to double type
unlist = udf(lambda x: round(float(list(x)[0]),3), DoubleType())

# Iterating over columns to be scaled
for i in ["revenue","num_of_days"]:
    # VectorAssembler Transformation - Converting column to vector type
    assembler = VectorAssembler(inputCols=[i],outputCol=i+"_Vect")

    # MinMaxScaler Transformation
    scaler = MinMaxScaler(inputCol=i+"_Vect", outputCol=i+"_Scaled")

    # Pipeline of VectorAssembler and MinMaxScaler
    pipeline = Pipeline(stages=[assembler, scaler])

    # Fitting pipeline on dataframe
    df = pipeline.fit(df).transform(df).withColumn(i+"_Scaled", unlist(i+"_Scaled")).drop(i+"_Vect")

print("After Scaling :")
df.show(5)

After Scaling :
+-------+-------+-----------+--------------+------------------+
|user_id|revenue|num_of_days|revenue_Scaled|num_of_days_Scaled|
+-------+-------+-----------+--------------+------------------+
|    100|  77560|         45|         0.866|             0.886|
|    200|  41560|         23|         0.402|             0.386|
|    300|  30285|         20|         0.257|             0.318|
|    400|  10345|          6|           0.0|               0.0|
|    500|  88000|         50|           1.0|               1.0|
+-------+-------+-----------+--------------+------------------+

