### Data standardization

In [2]:
import pandas as pd 

data = pd.read_csv('../data/outlier.csv')
data.head()

Unnamed: 0,x_coord,y_coord
0,-5.577854,5.872988
1,1.627832,4.178069
2,-6.371844,4.419223
3,1.750055,5.445829
4,6.550104,-7.912339


Lets standardize the data in ``x_coord`` column only.

In [4]:
dfx = data['x_coord']

dfx_mean = dfx.mean()
dfx_std = dfx.std()

print(f"mean value: {dfx_mean}  std value: {dfx_std}")

dfx_standardized = (dfx-dfx_mean)/dfx_std
dfx_standardized.head()


mean value: 0.6796271541090921  std value: 4.628880082934205


0   -1.351835
1    0.204845
2   -1.523364
3    0.231250
4    1.268228
Name: x_coord, dtype: float64

In [5]:
dfx.describe()

count    103.000000
mean       0.679627
std        4.628880
min       -8.333846
25%       -3.906847
50%        1.626857
75%        4.375579
max       15.000000
Name: x_coord, dtype: float64

In [6]:
dfx_standardized.describe()

count    1.030000e+02
mean     1.509041e-17
std      1.000000e+00
min     -1.947225e+00
25%     -9.908387e-01
50%      2.046347e-01
75%      7.984549e-01
max      3.093701e+00
Name: x_coord, dtype: float64

Lets standardize data in both columns together. Each column is standardized with respect to its own mean and standard deviation. Hence, values in each column will be zero mean and unit standard deviation.

In [9]:
df = data.copy()

df_mean = df.mean()
df_std = df.std() 

print(f"mean value: {df_mean}  max value: {df_std}")

df_standardized = (df-df_mean)/df_std

df_standardized.head()

mean value: x_coord    0.679627
y_coord   -1.212853
dtype: float64  max value: x_coord    4.628880
y_coord    6.479785
dtype: float64


Unnamed: 0,x_coord,y_coord
0,-1.351835,1.09353
1,0.204845,0.83196
2,-1.523364,0.869176
3,0.23125,1.027608
4,1.268228,-1.033906


In [8]:
df.describe()

Unnamed: 0,x_coord,y_coord
count,103.0,103.0
mean,0.679627,-1.212853
std,4.62888,6.479785
min,-8.333846,-11.328333
25%,-3.906847,-7.161316
50%,1.626857,-2.5
75%,4.375579,5.018325
max,15.0,10.0


In [10]:
df_standardized.describe()

Unnamed: 0,x_coord,y_coord
count,103.0,103.0
mean,1.5090410000000002e-17,4.7427000000000005e-17
std,1.0,1.0
min,-1.947225,-1.561083
25%,-0.9908387,-0.9180032
50%,0.2046347,-0.1986405
75%,0.7984549,0.9616335
max,3.093701,1.730436
