In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("./formatted_auto_data.csv")
df.head()
df.dtypes

symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-L/100km         float64
highway-mpg            int64
price                  int64
dtype: object

There are 3 method for normalizing data (putting them in the same scale):
1. Simple feature scaling: X' = X / max(X)
2. Min-Max normalization: X' = (X - min(X)) / (max(X) - min(X))
3. Z-score normalization: X' = (X - mean(X)) / std(X) (like the statistic of the normal distribution, centered and reduced)

After normalization, all values will be between 0 and 1.

In [4]:
df["length"]


0      168.8
1      171.2
2      176.6
3      176.6
4      177.3
       ...  
199    188.8
200    188.8
201    188.8
202    188.8
203    188.8
Name: length, Length: 204, dtype: float64

Let's apply it on the `length` column using the 3 methods.

### Simple feature scaling

In [6]:
df["length"] = df["length"]/df["length"].max()
df["length"]

0      0.811148
1      0.822681
2      0.848630
3      0.848630
4      0.851994
         ...   
199    0.907256
200    0.907256
201    0.907256
202    0.907256
203    0.907256
Name: length, Length: 204, dtype: float64

### Min-Max normalization

In [7]:
# first retrieve the original length column
df = pd.read_csv("./formatted_auto_data.csv")

# applying min-max normalization
df["length"] = (df["length"] - df["length"].min()) / (df["length"].max() - df["length"].min())
df["length"]

0      0.413433
1      0.449254
2      0.529851
3      0.529851
4      0.540299
         ...   
199    0.711940
200    0.711940
201    0.711940
202    0.711940
203    0.711940
Name: length, Length: 204, dtype: float64

### z-score normalization

In [8]:
df = pd.read_csv("./formatted_auto_data.csv")
df["length"] = (df["length"] - df["length"].mean()) / df["length"].std()
df["length"]

0     -0.426707
1     -0.232565
2      0.204253
3      0.204253
4      0.260878
         ...   
199    1.191138
200    1.191138
201    1.191138
202    1.191138
203    1.191138
Name: length, Length: 204, dtype: float64

In [9]:
# saving the normalized data
df.to_csv("./normalized_auto_data.csv", index=False)