# Feature Scaling


```
Feature scaling is a method for normalizing the range of data features or independent variables.
It's commonly used in machine learning and data processing to make computations faster and improve accuracy.

1. Standardization
2. Normalization
3. Robust scaling
```



# When to use feature scaling?

```
1. When preparing data for machine learning
2. When using algorithms that don't assume a particular distribution of data
3. When using neural networks that require data on a 0–1 scale
4. When performing clustering analyses
```

# Import dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('Churn_Modelling.csv')
df_cpy = df.copy()
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [3]:
df.isnull().sum()

Unnamed: 0,0
RowNumber,0
CustomerId,0
Surname,0
CreditScore,0
Geography,0
Gender,0
Age,0
Tenure,0
Balance,0
NumOfProducts,0


In [4]:
ls = df.select_dtypes(include=['int64', 'float64'])
ls = ['Age', 'EstimatedSalary']
ls

['Age', 'EstimatedSalary']

# 1. Standardization


```
Also known as Z-score normalization, this technique rescales features to have a mean of zero and a standard deviation of one.
It's useful when data follows a Gaussian distribution.
```



In [5]:
std_df = df_cpy[ls]
std_df.head()

Unnamed: 0,Age,EstimatedSalary
0,42,101348.88
1,41,112542.58
2,42,113931.57
3,39,93826.63
4,43,79084.1


In [6]:
mean = std_df['Age'].mean()
std = std_df['Age'].std()

mean, std

(38.9218, 10.487806451704591)

In [7]:
std_df['Std_Age'] = (std_df['Age'] - mean)/std
std_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  std_df['Std_Age'] = (std_df['Age'] - mean)/std


Unnamed: 0,Age,EstimatedSalary,Std_Age
0,42,101348.88,0.293503
1,41,112542.58,0.198154
2,42,113931.57,0.293503
3,39,93826.63,0.007456
4,43,79084.1,0.388852


In [8]:

std_df['Std_Age'] = (std_df['Age'] - std_df['Age'].mean())/std_df['Age'].std()
std_df['Std_EstimatedSalary'] = (std_df['EstimatedSalary'] - std_df['EstimatedSalary'].mean())/std_df['EstimatedSalary']
std_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  std_df['Std_Age'] = (std_df['Age'] - std_df['Age'].mean())/std_df['Age'].std()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  std_df['Std_EstimatedSalary'] = (std_df['EstimatedSalary'] - std_df['EstimatedSalary'].mean())/std_df['EstimatedSalary']


Unnamed: 0,Age,EstimatedSalary,Std_Age,Std_EstimatedSalary
0,42,101348.88,0.293503,0.012419
1,41,112542.58,0.198154,0.110646
2,42,113931.57,0.293503,0.121488
3,39,93826.63,0.007456,-0.066757
4,43,79084.1,0.388852,-0.265618


#2. Normalization


```
# Also known as min-max scaling, this technique transforms features to a range between zero and one.
It's useful when data doesn't follow a Gaussian distribution.
```



In [9]:
nrm_df = df_cpy[ls]
nrm_df.head()

Unnamed: 0,Age,EstimatedSalary
0,42,101348.88
1,41,112542.58
2,42,113931.57
3,39,93826.63
4,43,79084.1


In [10]:
col = 'Age'
X_min = nrm_df[col].min()
X_max = nrm_df[col].max()
X_max_min = X_max - X_min

X_min, X_max, X_max_min

(18, 92, 74)

In [11]:
col = 'EstimatedSalary'
Y_min = nrm_df[col].min()
Y_max = nrm_df[col].max()
Y_max_min = Y_max - Y_min

Y_min, Y_max, Y_max_min

(11.58, 199992.48, 199980.90000000002)

In [12]:
nrm_df['Nrm_Age'] = (nrm_df['Age'] - X_min)/X_max_min
nrm_df['Nrm_EstimatedSalary'] = (nrm_df['EstimatedSalary'] - Y_min)/Y_max_min
nrm_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nrm_df['Nrm_Age'] = (nrm_df['Age'] - X_min)/X_max_min
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nrm_df['Nrm_EstimatedSalary'] = (nrm_df['EstimatedSalary'] - Y_min)/Y_max_min


Unnamed: 0,Age,EstimatedSalary,Nrm_Age,Nrm_EstimatedSalary
0,42,101348.88,0.324324,0.506735
1,41,112542.58,0.310811,0.562709
2,42,113931.57,0.324324,0.569654
3,39,93826.63,0.283784,0.46912
4,43,79084.1,0.337838,0.3954
