# BAGIAN 3 FEATURE SCALING

tujuan : mengubah skala fitur agak seimbang sebelum masuk model
kapan dipakai : wajib sebelum training model ( kecuali tree-based model )

kenapa penting?
- tanpa scaling:
- gaji 5,000,000 ( jutaan)
- umur 25 (puluhan )
-> model menjadi bias ke fitur dengan nilai besar ( gaji )

**Dengan scaling :**
- gaji : 0.5 (setelha scaling)
- umur : 0.6 ( setelah scaling )
- > model terat semua fitur sama penting


Algoritma yang butuh scaling:
- Linear Regression
- Logistic Regression
- SVM ( Support Vector Machine )
- KNN ( K_Nearest Neighbors)
- Neural Network
- PCA ( Principal component Analysis)
- K-Means Clustering

Algoritma yang tidak butuh scaling:
- Decision Trees
- Random Forest
- XGBoost, LigtGBM
- Naive Bayes

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler , RobustScaler

print("=" * 60)
print("FEATURE SCALING")
print("=" * 60)

FEATURE SCALING


### BUAT DATA DUMMY

In [9]:
print("\n [Setup] Dataset dengan Skala Berbeda-beda\n")

data ={
'Nama' : ['ANDI', 'QUINCY', 'MAVERICK', 'JIWOO','CARMEN', 'MAX','VERSTAPPEN','MARQUEZ'],
    'Umur' : [24,27,22,30,28,24,35,22],  # skala : 20-35
    'Gaji' : [50000000,60000000,45000000,80000000,665000000,55000000,120000000,40000000],
    'Tahun_Kerja' : [2, 5, 1, 8, 4, 2, 10 ,1], # skala : 1 - 10
    'Rating_Kinerja' : [3.5, 4.2, 3.0, 4.8, 3.8, 5,0, 3.2]
}

df = pd.DataFrame(data)

print("Data Original  : ")
print(df)

print("\n Statistik deskriptif : ")
print(df.describe())


 [Setup] Dataset dengan Skala Berbeda-beda

Data Original  : 
         Nama  Umur       Gaji  Tahun_Kerja  Rating_Kinerja
0        ANDI    24   50000000            2             3.5
1      QUINCY    27   60000000            5             4.2
2    MAVERICK    22   45000000            1             3.0
3       JIWOO    30   80000000            8             4.8
4      CARMEN    28  665000000            4             3.8
5         MAX    24   55000000            2             5.0
6  VERSTAPPEN    35  120000000           10             0.0
7     MARQUEZ    22   40000000            1             3.2

 Statistik deskriptif : 
            Umur          Gaji  Tahun_Kerja  Rating_Kinerja
count   8.000000  8.000000e+00     8.000000        8.000000
mean   26.500000  1.393750e+08     4.125000        3.437500
std     4.472136  2.139332e+08     3.356763        1.561993
min    22.000000  4.000000e+07     1.000000        0.000000
25%    23.500000  4.875000e+07     1.750000        3.150000
50%    25.5

## metode 1 :  strandarization (z_score normalization )

formula : X_scaled = (X - mean ) / std_dev

Hasil:
- Mean = 0
- standar deviation = 1
- Range : Biasanya antara -3 sampai + 3

kapan standarization dipakai :
- data terdistribusi normal ( atau mendekati normal )
- algoritma : Linear regression, logistic regression, SVM, Neural Networks
- ada outlier tapi tidak terlalu ekstrem

Tidak cocok:
- jika data terkaku banayk outlier esktem ( maka gunakan robust scaller )



In [10]:
# pilih kolom numerik kecuali nama
numeric_cols = ['Umur', 'Gaji', 'Tahun_Kerja', 'Rating_Kinerja']

# buat scaler
scaler_standard = StandardScaler()

# fit dan transform ( PENTING : fit di training set,  transform di train & test )

df_standarized = df.copy()
df_standarized[numeric_cols] = scaler_standard.fit_transform(df[numeric_cols])

print("\n Hasil Standardization : ")
print(df_standarized)

print("\n Verikasi Mean & std ( harus mendekati 0 dan 1 ) : ")
print(df_standarized[numeric_cols].describe())


 Hasil Standardization : 
         Nama      Umur      Gaji  Tahun_Kerja  Rating_Kinerja
0        ANDI -0.597614 -0.446616    -0.676759        0.042776
1      QUINCY  0.119523 -0.396645     0.278666        0.521863
2    MAVERICK -1.075706 -0.471601    -0.995234       -0.299430
3       JIWOO  0.836660 -0.296703     1.234091        0.932510
4      CARMEN  0.358569  2.626599    -0.039809        0.248099
5         MAX -0.597614 -0.421630    -0.676759        1.069392
6  VERSTAPPEN  2.031889 -0.096819     1.871040       -2.352663
7     MARQUEZ -1.075706 -0.496587    -0.995234       -0.162548

 Verikasi Mean & std ( harus mendekati 0 dan 1 ) : 
               Umur          Gaji   Tahun_Kerja  Rating_Kinerja
count  8.000000e+00  8.000000e+00  8.000000e+00        8.000000
mean   4.163336e-17 -2.775558e-17  6.938894e-18        0.000000
std    1.069045e+00  1.069045e+00  1.069045e+00        1.069045
min   -1.075706e+00 -4.965865e-01 -9.952343e-01       -2.352663
25%   -7.171372e-01 -4.528619e-01

## MIN MAX NORMALIZATION

FORMULA : X_scaled = (X - X_min) / (X_max - X_min)

hasil :
- range : 0 - 1 ( bisa di ubah ke range lain )

kapan di pakai :
- butuh range fix [0,1] (contoh : image pixels, probability)
- Algortima : Neural Networks ( Rterutama dengan activation sigmoid/tanh)
- Data tidak punya outlier ekstrem
- K-Means Clustering

Tidak cocok:
 Data dengan outlier banyal ( outlier akan "menekan" nilai lain

In [13]:
# buat scaler Min-Max
scaler_minmax = MinMaxScaler()

df_normalized = df.copy()
df_normalized[numeric_cols] = scaler_minmax.fit_transform(df[numeric_cols])

print("\n Hasil Min-Max Normalization : ")
print(df_normalized)

print("\n Verifikasi range : ")
print(df_normalized[numeric_cols].describe())

# custom range ( misalnya 1 - 10 )
print("\n Bonus : Custom Range [ 0 , 10 ] ")
scaler_custom = MinMaxScaler(feature_range=(0,10))
df_custom = df.copy()
df_custom[numeric_cols] = scaler_custom.fit_transform(df[numeric_cols])
print([df_custom[['Nama', 'Umur', 'Gaji']]])


 Hasil Min-Max Normalization : 
         Nama      Umur   Gaji  Tahun_Kerja  Rating_Kinerja
0        ANDI  0.153846  0.016     0.111111            0.70
1      QUINCY  0.384615  0.032     0.444444            0.84
2    MAVERICK  0.000000  0.008     0.000000            0.60
3       JIWOO  0.615385  0.064     0.777778            0.96
4      CARMEN  0.461538  1.000     0.333333            0.76
5         MAX  0.153846  0.024     0.111111            1.00
6  VERSTAPPEN  1.000000  0.128     1.000000            0.00
7     MARQUEZ  0.000000  0.000     0.000000            0.64

 Verifikasi range : 
           Umur      Gaji  Tahun_Kerja  Rating_Kinerja
count  8.000000  8.000000     8.000000        8.000000
mean   0.346154  0.159000     0.347222        0.687500
std    0.344010  0.342293     0.372974        0.312399
min    0.000000  0.000000     0.000000        0.000000
25%    0.115385  0.014000     0.083333        0.630000
50%    0.269231  0.028000     0.222222        0.730000
75%    0.500000  0.0

## ROBUST SCALING

formula : X_scaled = (X - median) / IQR
IQR = Q3 - Q1 (interquartile Range )

hasil :
- tidak terpengaruh oleh outlier ( pakai median dan IQR , bukan mean & std )
- Range : Tidak Fix, Tergantung Data

Kapan Dipakai:
- data dengan outlier banyak
- ingin robust terhadap nilai ekstrem
- setelah coba Standardscaler tapi hasilnya jelek


cocok untuk :
- Data finansial ( sering ada outlier )
- data sensor  ( bisa error / spike )

In [19]:
# tambah outlier ke data

df_outlier = df.copy()
df_outlier.loc[len(df_outlier)] = ['Outlier', 50, 500000000, 20, 5.0]

print("\n Data dengan Outlier : ")
print(df_outlier[['Nama', 'Umur', 'Gaji','Tahun_Kerja']])

# coba menggunakan standardscaler vs robustscaler
scaler_std = StandardScaler()
scaler_robust = RobustScaler()

df_std_with_outlier = df_outlier.copy()
df_robust_with_outlier = df_outlier.copy()

df_std_with_outlier[numeric_cols] = scaler_std.fit_transform(df_outlier[numeric_cols])
df_robust_with_outlier[numeric_cols] = scaler_robust.fit_transform(df_outlier[numeric_cols])

print("\n hasil StandardSclaer dengan outlier : ")
print(df_std_with_outlier[['Nama', 'Gaji']].tail(3))
print("\n -> outlier membuat nilai normal jadi sangat negatif ! ")

print("\n Hasil RobustScaler (dengan outlier) : ")
print(df_robust_with_outlier[['Nama', 'Gaji']].tail(3))
print("-> Nilai normal tidak terpengaruh oleh outlier")


 Data dengan Outlier : 
         Nama  Umur       Gaji  Tahun_Kerja
0        ANDI    24   50000000            2
1      QUINCY    27   60000000            5
2    MAVERICK    22   45000000            1
3       JIWOO    30   80000000            8
4      CARMEN    28  665000000            4
5         MAX    24   55000000            2
6  VERSTAPPEN    35  120000000           10
7     MARQUEZ    22   40000000            1
8     Outlier    50  500000000           20

 hasil StandardSclaer dengan outlier : 
         Nama      Gaji
6  VERSTAPPEN -0.270087
7     MARQUEZ -0.633568
8     Outlier  1.456449

 -> outlier membuat nilai normal jadi sangat negatif ! 

 Hasil RobustScaler (dengan outlier) : 
         Nama      Gaji
6  VERSTAPPEN  0.857143
7     MARQUEZ -0.285714
8     Outlier  6.285714
-> Nilai normal tidak terpengaruh oleh outlier


# perbandingan semua metode

In [22]:
# ambil 1 data untuk dibanginkan
sample = df[df['Nama'] == 'JIWOO'][['Gaji']].values[0][0]

# hitung manual

mean_gaji = df['Gaji'].mean()
std_gaji = df['Gaji'].std()
min_gaji = df['Gaji'].min()
max_gaji = df['Gaji'].max()
median_gaji = df['Gaji'].median()
q1_gaji = df['Gaji'].quantile(0.25)
q3_gaji = df['Gaji'].quantile(0.75)

standardized = (sample - mean_gaji) / std_gaji
normalized = (sample - min_gaji) / (max_gaji - min_gaji)
robust = (sample - median_gaji) / (q3_gaji - q1_gaji)

print(f"\nContoh : Gaji JIWOO = Rp {sample:,.0f}")
print("="*60)
print(f"standarscaler : {standardized:.3f}")
print(f"MinMaxScaker : {normalized:.3f}")
print(f"RobustScaler : {robust:.3f}")


Contoh : Gaji JIWOO = Rp 80,000,000
standarscaler : -0.278
MinMaxScaker : 0.064
RobustScaler : 0.545


# praktek terbaik train test-split

kesalahan fatal yang sering terjadi :
- fit scaler di seluruh data ( train + test )
- fit scaler hanya di training set,  transform ke train & test

kenapa ?  untuk menghindari DATA LEAKAGE

In [33]:
# simulasi train test split
from sklearn.model_selection import train_test_split

# pisahkan features dan target ( misalnya prediksi gaji )
X = df[['Umur', 'Tahun_Kerja', 'Rating_Kinerja']]
y = df['Gaji']

# split 80 % untuk train dan 20 % untuk test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print(f"\n Data Train : {len(X_train)} baris")
print(f"Data Test : {len(X_test)} baris")

# CARA BENAR
print("\n CARA BENAR : ")
scaler = StandardScaler()

# 1 FIT GANYA DI training set
scaler.fit(X_train)

# 2 TRANSFORM TRAIN DAN TEST
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("     1. Fit scaler di training set")
print("     2. Transform training set")
print("     3. Transofrm test set ( menggunaka parameter dari training")

print("\n X_train_scaled ( sample ) : ")
print(pd.DataFrame(X_train_scaled, columns= X.columns).head(3))

# cara yang salah
print("\n❌ CARA SALAH (Jangan ditiru!)")
print("   scaler.fit(pd.concat([X_train, X_test]))  # SALAH!")
print("   → Ini menyebabkan DATA LEAKAGE!")

# ============================================================================
# INVERSE TRANSFORM (Kembalikan ke Skala Asli)
# ============================================================================
print()
print("Inverse transform")
print("mengembalikan data yang sudah di scale ke nilai original")

# contoh setelah prediksi kita mau menampilkan secara rupiah
sample_scaled = X_train_scaled[0:1] # ambil 1 sample
sasmple_original = scaler.inverse_transform(sample_scaled)

print(f"\n Data scaled : {sasmple_original[0]}")
print(f"\n Data original : {sasmple_original[0]}")


 Data Train : 6 baris
Data Test : 2 baris

 CARA BENAR : 
     1. Fit scaler di training set
     2. Transform training set
     3. Transofrm test set ( menggunaka parameter dari training

 X_train_scaled ( sample ) : 
       Umur  Tahun_Kerja  Rating_Kinerja
0 -0.602171    -0.667424        0.303908
1 -1.027233    -0.953463        0.101303
2 -1.027233    -0.953463       -0.033768

❌ CARA SALAH (Jangan ditiru!)
   scaler.fit(pd.concat([X_train, X_test]))  # SALAH!
   → Ini menyebabkan DATA LEAKAGE!

Inverse transform
mengembalikan data yang sudah di scale ke nilai original

 Data scaled : [24.   2.   3.5]

 Data original : [24.   2.   3.5]


# decision treee : pilih model scaling

Data terdistribusi normal ?
- ya ? standarscaler
- tidak
- ada banyak outlier ? jika ya gunakan robust scaler jika tidak gunakan miinmax atau standarscaler

butuh range [0,1] spesifik ?
jika ya gunakan minmaxsclaer

dan jika terdistribusi normal pakai neural network? minimaxscaler atau standarscaler

ingat :
- tree based models ( decision tree, random forest, xg boost tidak perlu scaling
- selalu fit di tarining set, transfpr, di train dan test
- simpan object untuk production