In [37]:
import pandas as pd
import sklearn

# Standardization in Data Analysis

Standardization, also known as z-score normalization, is a statistical method used to transform and scale the features (variables) of a dataset so that they have a mean of 0 and a standard deviation of 1. This process is commonly applied to features with different units or scales to make them more comparable.

The formula for standardization (z-score) of a variable \(X\) is given by:

$$ [ Z = \frac{{X - \mu}}{{\sigma}} ] $$

The standardization process has several benefits, including:

1. **Comparability:** Standardizing variables ensures that they are on a similar scale, making it easier to compare them.
2. **Machine Learning Algorithms:** Many machine learning algorithms, such as k-means clustering or support vector machines, perform better when features are standardized.
3. **Gradient Descent Convergence:** Standardization can help gradient descent algorithms converge more quickly.

In Python, you can use libraries like scikit-learn to standardize your data:

In [33]:
df=pd.read_csv(r'D:\Machine Learning\Machine-Learning\Data\Social_Network_Ads.csv')
df

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
...,...,...,...,...,...
395,15691863,Female,46,41000,1
396,15706071,Male,51,23000,1
397,15654296,Female,50,20000,1
398,15755018,Male,36,33000,0


In [34]:
df=df.iloc[:,2:]
df.sample(5)


Unnamed: 0,Age,EstimatedSalary,Purchased
68,22,63000,0
234,38,112000,0
292,55,39000,1
348,39,77000,0
162,37,33000,0


In [35]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(df.drop('Purchased',axis=1),
                                            df['Purchased'],test_size=0.3,random_state=0)

x_train.shape,x_test.shape

((280, 2), (120, 2))

In [38]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

# fit the scaler to the train set, it will learn the parameters
scaler.fit(x_train)

# transform train and test sets
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [43]:
print(scaler.mean_)
print(scaler.var_)

[3.78642857e+01 6.98071429e+04]
[1.04038724e+02 1.19572709e+09]


In [47]:
print(x_train)
# we must convert this numpy array to dataframe
print(x_train_scaled)

     Age  EstimatedSalary
92    26            15000
223   60           102000
234   38           112000
232   40           107000
377   42            53000
..   ...              ...
323   48            30000
192   29            43000
117   36            52000
47    27            54000
172   26           118000

[280 rows x 2 columns]
[[-1.1631724  -1.5849703 ]
 [ 2.17018137  0.93098672]
 [ 0.0133054   1.22017719]
 [ 0.20938504  1.07558195]
 [ 0.40546467 -0.48604654]
 [-0.28081405 -0.31253226]
 [ 0.99370357 -0.8330751 ]
 [ 0.99370357  1.8563962 ]
 [ 0.0133054   1.24909623]
 [-0.86905295  2.26126285]
 [-1.1631724  -1.5849703 ]
 [ 2.17018137 -0.80415605]
 [-1.35925203 -1.46929411]
 [ 0.40546467  2.2901819 ]
 [ 0.79762394  0.75747245]
 [-0.96709276 -0.31253226]
 [ 0.11134522  0.75747245]
 [-0.96709276  0.55503912]
 [ 0.30742485  0.06341534]
 [ 0.69958412 -1.26686079]
 [-0.47689368 -0.0233418 ]
 [-1.7514113   0.3526058 ]
 [-0.67297331  0.12125343]
 [ 0.40546467  0.29476771]
 [-0.28081405  0