**NORMALIZATION**


*  Normalization in data science is a technique used to adjust the values of numeric data to a common scale without distorting differences in the ranges of values.

*   Normalization helps in bringing all features to the same scale, improving the performance and training stability of machine learning algorithms. It is particularly useful when features have different units or ranges.






**Data Normalization**


*  Data normalization is a vital pre-processing, mapping, and scaling method that helps forecasting and prediction models become more accurate. The current data range is transformed into a new, standardized range using this method.




In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [5]:
df=pd.read_csv('/content/Social_Network_Ads (1).csv')

In [None]:
df

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19.0,19000.0,0
1,15810944,Male,35.0,20000.0,0
2,15668575,Female,26.0,43000.0,0
3,15603246,Female,27.0,57000.0,0
4,15804002,Male,19.0,76000.0,0
...,...,...,...,...,...
395,15691863,Female,46.0,41000.0,1
396,15706071,Male,51.0,23000.0,1
397,15654296,Female,50.0,20000.0,1
398,15755018,Male,36.0,33000.0,0


In [6]:
df1=pd.read_csv('/content/Social_Media_Ads.csv')

In [None]:
df1

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0
...,...,...,...
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0


In [29]:
from sklearn.preprocessing import Normalizer

In [None]:
scaler = Normalizer()

In [None]:
scaled_data = scaler.fit_transform(df1)

In [None]:
scaled_df1= pd.DataFrame(scaled_data, columns=df1.columns)

In [None]:
print(scaled_df1.head())

        Age  EstimatedSalary  Purchased
0  0.001000         1.000000        0.0
1  0.001750         0.999998        0.0
2  0.000605         1.000000        0.0
3  0.000474         1.000000        0.0
4  0.000250         1.000000        0.0


*Using normalizer for dataset with string values*

In [24]:
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19.0,19000.0,0
1,15810944,Male,35.0,20000.0,0
2,15668575,Female,26.0,43000.0,0
3,15603246,Female,27.0,57000.0,0
4,15804002,Male,19.0,76000.0,0


In [25]:
categorical_cols = ['Gender']

In [26]:
df_encoded = pd.get_dummies(df, columns=categorical_cols)

In [27]:
df_encoded = df_encoded.drop(columns=['User ID'])

In [30]:
scaler = Normalizer()

In [31]:
scaled_data = scaler.fit_transform(df_encoded)

In [32]:
scaled_df = pd.DataFrame(scaled_data, columns=df_encoded.columns)


In [33]:
scaled_df.head()

Unnamed: 0,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
0,0.001,0.999999,0.0,0.0,5.3e-05
1,0.00175,0.999998,0.0,0.0,5e-05
2,0.000605,1.0,0.0,2.3e-05,0.0
3,0.000474,1.0,0.0,1.8e-05,0.0
4,0.00025,1.0,0.0,0.0,1.3e-05


**Normalization Techniques**


*   Z-score Standardization
*   Min Max Normalization
*   Decimal Normalization



**Min Max Scaling**


*   Min-max is a scaling technique where values are rescaled and shifted so that they range between 0 and 1 or between -1 and 1.





In [None]:
df1

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0
...,...,...,...
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler=MinMaxScaler()
print(scaler.fit(df1))

MinMaxScaler()


In [None]:
x=df1[['Age','EstimatedSalary']]

In [None]:
x_scaled=scaler.fit_transform(x)

In [None]:
x_scaled

array([[0.02380952, 0.02962963],
       [0.4047619 , 0.03703704],
       [0.19047619, 0.20740741],
       [0.21428571, 0.31111111],
       [0.02380952, 0.45185185],
       [0.21428571, 0.31851852],
       [0.21428571, 0.51111111],
       [0.33333333, 1.        ],
       [0.16666667, 0.13333333],
       [0.4047619 , 0.37037037],
       [0.19047619, 0.48148148],
       [0.19047619, 0.27407407],
       [0.04761905, 0.52592593],
       [0.33333333, 0.02222222],
       [0.        , 0.4962963 ],
       [0.26190476, 0.48148148],
       [0.69047619, 0.07407407],
       [0.64285714, 0.08148148],
       [0.66666667, 0.0962963 ],
       [0.71428571, 0.1037037 ],
       [0.64285714, 0.05185185],
       [0.69047619, 0.25185185],
       [0.71428571, 0.19259259],
       [0.64285714, 0.05185185],
       [0.66666667, 0.05925926],
       [0.69047619, 0.03703704],
       [0.73809524, 0.0962963 ],
       [0.69047619, 0.11111111],
       [0.26190476, 0.20740741],
       [0.30952381, 0.02222222],
       [0.

**Z- score Normalization**



*  If a value is exactly equal to the mean of all the values of the feature, it will be normalized to 0. If it is below the mean, it will be a negative number, and if it is above the mean it will be a positive number.




In [2]:
from sklearn.preprocessing import StandardScaler

In [7]:
scaler=StandardScaler()

In [8]:
print(scaler.fit_transform(df1))

[[-1.78179743 -1.49004624 -0.74593581]
 [-0.25358736 -1.46068138 -0.74593581]
 [-1.11320552 -0.78528968 -0.74593581]
 ...
 [ 1.17910958 -1.46068138  1.34059793]
 [-0.15807423 -1.07893824 -0.74593581]
 [ 1.08359645 -0.99084367  1.34059793]]


**Decimal Scaling**



*  Decimal scaling is a normalization technique in which the values of a feature are scaled by dividing them by a power of 10. This power is determined by the maximum absolute value of the feature, ensuring that all resulting values fall within a specific range.
*  Consider the dataset with values 45, 678, and 90011:

The maximum absolute value is 90011, which has 5 digits.
To scale the data, divide each value 10^5 (i.e., 100,000):
45 becomes
0.00045
678 becomes
0.00678
90011 becomes
0.90011





In [18]:
def decimal_scaling(df):
    # Find the maximum absolute value in the entire DataFrame
    max_val = df.abs().max().max()

    # Determine the scaling factor 'j' (number of decimal places to move)
    j = int(np.ceil(np.log10(max_val)))

    # Scale the DataFrame
    scaled_data = df / (10**j)

    return scaled_data

In [22]:
df1 = pd.DataFrame(df1)

# Perform decimal scaling
scaled_df1 = decimal_scaling(df1)
print(df1)
print(scaled_df1)

     Age  EstimatedSalary  Purchased
0     19            19000          0
1     35            20000          0
2     26            43000          0
3     27            57000          0
4     19            76000          0
..   ...              ...        ...
395   46            41000          1
396   51            23000          1
397   50            20000          1
398   36            33000          0
399   49            36000          1

[400 rows x 3 columns]
          Age  EstimatedSalary  Purchased
0    0.000019            0.019   0.000000
1    0.000035            0.020   0.000000
2    0.000026            0.043   0.000000
3    0.000027            0.057   0.000000
4    0.000019            0.076   0.000000
..        ...              ...        ...
395  0.000046            0.041   0.000001
396  0.000051            0.023   0.000001
397  0.000050            0.020   0.000001
398  0.000036            0.033   0.000000
399  0.000049            0.036   0.000001

[400 rows x 3 columns]


**Log Scaling**



*   Log scaling computes the log of the values to compress a wide range to a narrow range. In other words, it helps convert a skewed distribution to a normal distribution/less-skewed distribution. To perform the scaling, take the log values in a column and use them as the column instead. x'=log(x) eg:-24:log(24)=1.38



In [34]:
from sklearn.preprocessing import FunctionTransformer
transformer=FunctionTransformer()
print(transformer.fit_transform(df))

      User ID  Gender   Age  EstimatedSalary  Purchased
0    15624510    Male  19.0          19000.0          0
1    15810944    Male  35.0          20000.0          0
2    15668575  Female  26.0          43000.0          0
3    15603246  Female  27.0          57000.0          0
4    15804002    Male  19.0          76000.0          0
..        ...     ...   ...              ...        ...
395  15691863  Female  46.0          41000.0          1
396  15706071    Male  51.0          23000.0          1
397  15654296  Female  50.0          20000.0          1
398  15755018    Male  36.0          33000.0          0
399  15594041  Female  49.0          36000.0          1

[400 rows x 5 columns]


**Robust Scaling**



*  This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).



In [36]:
from sklearn.preprocessing import RobustScaler
scaler=RobustScaler()
print(scaler.fit_transform(df1))

[[-1.10769231 -1.13333333  0.        ]
 [-0.12307692 -1.11111111  0.        ]
 [-0.67692308 -0.6         0.        ]
 ...
 [ 0.8        -1.11111111  1.        ]
 [-0.06153846 -0.82222222  0.        ]
 [ 0.73846154 -0.75555556  1.        ]]
