* Normalization

* Standardization

* Log Transformation

* Robust Scaler

* Max Absolute Scaler

In [13]:
import pandas as pd
import numpy as np
import requests
from sklearn.preprocessing import StandardScaler,MaxAbsScaler,MinMaxScaler
from sklearn.preprocessing import FunctionTransformer,RobustScaler


In [2]:
url = 'https://raw.githubusercontent.com/jarif87/DataSets/main/supershops.csv'

response = requests.get(url)

data = pd.read_csv(url)

data.head()

Unnamed: 0,Marketing Spend,Administration,Transport,Area,Profit
0,114523.61,136897.8,471784.1,Dhaka,192261.83
1,162597.7,151377.59,443898.53,Ctg,191792.06
2,153441.51,101145.55,407934.54,Rangpur,191050.39
3,144372.41,118671.85,383199.62,Dhaka,182901.99
4,142107.34,91391.77,366168.42,Rangpur,166187.94


In [3]:
df0=data.copy()
df1=data.copy()
df2=data.copy()
df3=data.copy()
df4=data.copy()

# Normalization

In [4]:


# Select only the numeric columns
numeric_cols = df0.select_dtypes(include='number')

# Initialize the MinMaxScaler
scaler = MinMaxScaler()  #default feature_range=(0,1)

# Apply MinMaxScaler to each numeric column using a loop
for column in numeric_cols:
    # Fit and transform the column
    scaled_values = scaler.fit_transform(df0[[column]])
    
    # Replace the original column with the scaled values
    df0[column] = scaled_values


In [5]:
numeric_cols.head()

Unnamed: 0,Marketing Spend,Administration,Transport,Profit
0,114523.61,136897.8,471784.1,192261.83
1,162597.7,151377.59,443898.53,191792.06
2,153441.51,101145.55,407934.54,191050.39
3,144372.41,118671.85,383199.62,182901.99
4,142107.34,91391.77,366168.42,166187.94


In [6]:
df0.head()

Unnamed: 0,Marketing Spend,Administration,Transport,Area,Profit
0,0.692617,0.651744,1.0,Dhaka,1.0
1,0.983359,0.761972,0.940893,Ctg,0.997355
2,0.927985,0.379579,0.864664,Rangpur,0.993178
3,0.873136,0.512998,0.812235,Dhaka,0.947292
4,0.859438,0.305328,0.776136,Rangpur,0.853171


In [7]:
df0["Marketing Spend"].max()

1.0

In [8]:
df0["Marketing Spend"].min()

0.0

# Standardization

* **Standardize features by removing the mean and scaling to unit variance.**

* **The standard score of a sample x is calculated as:**

* **z = (x - u) / s**

In [11]:
numeric_col=df1.select_dtypes(include="number")
stand_scal=StandardScaler()
for my_col in numeric_col:
  standard_value=stand_scal.fit_transform(df1[[my_col]])
  df1[my_col]=standard_value

In [12]:
df1.head()

Unnamed: 0,Marketing Spend,Administration,Transport,Area,Profit
0,0.897913,0.560753,2.165287,Dhaka,2.011203
1,1.95586,1.082807,1.929843,Ctg,1.99943
2,1.754364,-0.728257,1.626191,Rangpur,1.980842
3,1.554784,-0.096365,1.417348,Dhaka,1.776627
4,1.504937,-1.079919,1.27355,Rangpur,1.35774


# Log Transformation

# FunctionTransformer

### Constructs a transformer from an arbitrary callable.

* **A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.**

* **Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.**

In [14]:
func_col=df2.select_dtypes(include="number")
func_transformer=FunctionTransformer(np.log1p)
for trans_col in func_col:
  trans_value=func_transformer.fit_transform(df2[[trans_col]])
  df2[trans_col]=trans_value

In [15]:
df2.head()

Unnamed: 0,Marketing Spend,Administration,Transport,Area,Profit
0,11.648545,11.826997,13.064279,Dhaka,12.166619
1,11.99904,11.927539,13.003354,Ctg,12.164172
2,11.941081,11.524326,12.918864,Rangpur,12.160298
3,11.880158,11.684126,12.856314,Dhaka,12.116711
4,11.864345,11.422922,12.810851,Rangpur,12.020881


# Robust Scaler

# Scale features using statistics that are robust to outliers.

* **This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).**

* **Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.**

* **Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.**

In [18]:
rb_col=df3.select_dtypes(include="number")
transformer = RobustScaler()
for myrb_col in rb_col:
  trans_value=transformer.fit_transform(df3[[myrb_col]])
  df3[myrb_col]=trans_value

In [19]:
df3.head()

Unnamed: 0,Marketing Spend,Administration,Transport,Area,Profit
0,0.67253,0.345355,1.552016,Dhaka,1.69834
1,1.452113,0.697565,1.383714,Ctg,1.688874
2,1.303634,-0.52429,1.166654,Rangpur,1.673929
3,1.156567,-0.097977,1.017368,Dhaka,1.509736
4,1.119836,-0.761543,0.914576,Rangpur,1.172943


# Max Absolute Scaler

# Scale each feature by its maximum absolute value.

* **This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.**

**This scaler can also be applied to sparse CSR or CSC matrices**

In [20]:
mx_absolute=df4.select_dtypes(include="number")
mx_absvalue=MaxAbsScaler()
for max_abs_col in mx_absolute:
  mx_abs_scale=mx_absvalue.fit_transform(df4[[max_abs_col]])
  df4[max_abs_col]=mx_abs_scale

In [21]:
df4.head()

Unnamed: 0,Marketing Spend,Administration,Transport,Area,Profit
0,0.692617,0.749527,1.0,Dhaka,1.0
1,0.983359,0.828805,0.940893,Ctg,0.997557
2,0.927985,0.553781,0.864664,Rangpur,0.993699
3,0.873136,0.649738,0.812235,Dhaka,0.951317
4,0.859438,0.500378,0.776136,Rangpur,0.864383
