### Data normalization

In [3]:
import pandas as pd 

data = pd.read_csv('../data/outlier.csv')
data.head()

Unnamed: 0,x_coord,y_coord
0,-5.577854,5.872988
1,1.627832,4.178069
2,-6.371844,4.419223
3,1.750055,5.445829
4,6.550104,-7.912339


Lets normalize the data in ``x_coord`` column only.

In [12]:
dfx = data['x_coord']

dfx_min = dfx.min()
dfx_max = dfx.max()

print(f"min value: {dfx_min}  max value: {dfx_max}")

dfx_normalized = (dfx-dfx_min)/(dfx_max-dfx_min)
dfx_normalized.head()


min value: -8.333846033770143  max value: 15.0


0    0.118111
1    0.426920
2    0.084084
3    0.432158
4    0.637870
Name: x_coord, dtype: float64

In [13]:
dfx_normalized.describe()

count    103.000000
mean       0.386283
std        0.198376
min        0.000000
25%        0.189724
50%        0.426878
75%        0.544678
max        1.000000
Name: x_coord, dtype: float64

Lets normalize data in both columns together. Each column is normalized with respect to its own minimum and maximum. Hence, values in each column will be in [0, 1] range.

In [16]:
df = data.copy()

df_min = df.min()
df_max = df.max() 

print(f"min value: {df_min}  max value: {df_max}")

df_normalized = (df-df_min)/(df_max-df_min)

df_normalized.head()

min value: x_coord    -8.333846
y_coord   -11.328333
dtype: float64  max value: x_coord    15.0
y_coord    10.0
dtype: float64


Unnamed: 0,x_coord,y_coord
0,0.118111,0.806501
1,0.42692,0.727033
2,0.084084,0.73834
3,0.432158,0.786473
4,0.63787,0.160162


In [17]:
df.describe()

Unnamed: 0,x_coord,y_coord
count,103.0,103.0
mean,0.679627,-1.212853
std,4.62888,6.479785
min,-8.333846,-11.328333
25%,-3.906847,-7.161316
50%,1.626857,-2.5
75%,4.375579,5.018325
max,15.0,10.0


In [18]:
df_normalized.describe()

Unnamed: 0,x_coord,y_coord
count,103.0,103.0
mean,0.386283,0.474274
std,0.198376,0.303811
min,0.0,0.0
25%,0.189724,0.195375
50%,0.426878,0.413925
75%,0.544678,0.766429
max,1.0,1.0
