# Normalisasi & Transformasi Data â€” NIM 434231048

**Mata Kuliah**: Data Mining  
**Nama File Data**: `shopping_data.csv`  
**Penulis**: Surya Dwi Satria (NIM 434231048)

Notebook ini melakukan:
1. Import & eksplorasi data  
2. Preprocessing: menangani missing value, duplikat, dan outlier, serta encoding kategori  
3. Normalisasi/standarisasi: Simple Feature Scaling, Min-Max, dan Z-Score  
4. Visualisasi & perbandingan sebelum vs sesudah scaling

> Catatan: Dataset dibuat sintetis untuk keperluan tugas praktikum.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", None)
data_path = "/mnt/data/shopping_data.csv"
df = pd.read_csv(data_path)
df.head()

In [None]:
df.info()

In [None]:
df.describe(include='all').T

## Cek Missing Values & Duplikat

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

### Menangani Duplikat

In [None]:
df = df.drop_duplicates().reset_index(drop=True)
df.duplicated().sum()

### Imputasi Missing Values (Numerik: median, Kategorikal: modus)

In [None]:
num_cols = df.select_dtypes(include=['float64','int64']).columns.tolist()
cat_cols = df.select_dtypes(include=['object']).columns.tolist()

for c in num_cols:
    df[c] = df[c].fillna(df[c].median())

for c in cat_cols:
    df[c] = df[c].fillna(df[c].mode().iloc[0])

df.isna().sum()

### Menangani Outlier (Capping IQR)

In [None]:
def iqr_cap(series, k=1.5):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - k*iqr
    upper = q3 + k*iqr
    return series.clip(lower, upper)

num_cols_wo_id = [c for c in num_cols if c != "CustomerID"]
df[num_cols_wo_id] = df[num_cols_wo_id].apply(iqr_cap)

df.describe().T

### Encoding Kategorikal (One-Hot)

In [None]:
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)
df_encoded.head()

In [None]:
clean_path = "/mnt/data/shopping_data_clean.csv"
df.to_csv(clean_path, index=False)
encoded_path = "/mnt/data/shopping_data_encoded.csv"
df_encoded.to_csv(encoded_path, index=False)
(clean_path, encoded_path)

## Visualisasi Distribusi (Sebelum Scaling)

In [None]:
plot_cols = ['Age','AnnualIncome','SpendingScore','TenureYear','NumTransactions','AvgBasketValue']
for c in plot_cols:
    plt.figure()
    df[c].hist(bins=30)
    plt.title(f"Distribusi {c} (sebelum scaling)")
    plt.xlabel(c); plt.ylabel("Freq")
    plt.tight_layout()
    plt.savefig(f"/mnt/data/before_{c}.png")
    plt.show()

## Normalisasi / Standardisasi

### 1) Simple Feature Scaling (x / max(x))

In [None]:
def simple_feature_scaling(x: pd.Series):
    m = x.max()
    return x / m if m != 0 else x

sfs = df_encoded.copy()
for c in plot_cols:
    sfs[c] = simple_feature_scaling(sfs[c])

sfs.head()

### 2) Min-Max Scaling (0..1)

In [None]:
def minmax_scaling(x: pd.Series):
    mn, mx = x.min(), x.max()
    return (x - mn) / (mx - mn) if mx != mn else x

mm = df_encoded.copy()
for c in plot_cols:
    mm[c] = minmax_scaling(mm[c])

mm.head()

### 3) Z-Score Standardization

In [None]:
def zscore(x: pd.Series):
    mu, sd = x.mean(), x.std(ddof=0)
    return (x - mu) / sd if sd != 0 else x

zs = df_encoded.copy()
for c in plot_cols:
    zs[c] = zscore(zs[c])

zs.head()

## Visualisasi Distribusi (Sesudah Scaling)

In [None]:
for c in plot_cols:
    for name, frame in [("sfs", sfs), ("mm", mm), ("zs", zs)]:
        plt.figure()
        frame[c].hist(bins=30)
        plt.title(f"Distribusi {c} ({name})")
        plt.xlabel(c); plt.ylabel("Freq")
        plt.tight_layout()
        plt.savefig(f"/mnt/data/after_{name}_{c}.png")
        plt.show()

## Perbandingan Nilai (Head)

In [None]:
compare = pd.DataFrame({
    'Asli_Age': df['Age'].head(10).values,
    'SFS_Age': sfs['Age'].head(10).values,
    'MM_Age': mm['Age'].head(10).values,
    'ZS_Age': zs['Age'].head(10).values
})
compare

In [None]:
sfs_path = "/mnt/data/shopping_data_scaled_sfs.csv"
mm_path = "/mnt/data/shopping_data_scaled_minmax.csv"
zs_path = "/mnt/data/shopping_data_scaled_zscore.csv"
sfs.to_csv(sfs_path, index=False)
mm.to_csv(mm_path, index=False)
zs.to_csv(zs_path, index=False)
(sfs_path, mm_path, zs_path)

## Kesimpulan Singkat

- **Simple Feature Scaling** efektif untuk membawa fitur ke rentang relatif terhadap nilai maksimum, tetapi tidak menjamin rentang 0..1 untuk data negatif (tidak ada di data ini).  
- **Min-Max** memetakan nilai ke rentang **0..1**, cocok untuk algoritme berbasis jarak.  
- **Z-Score** menormalkan menjadi mean 0 dan deviasi standar 1, cocok untuk algoritme yang mengasumsikan distribusi normal.  

File yang dihasilkan:
- `/mnt/data/shopping_data_clean.csv`
- `/mnt/data/shopping_data_encoded.csv`
- `/mnt/data/shopping_data_scaled_sfs.csv`
- `/mnt/data/shopping_data_scaled_minmax.csv`
- `/mnt/data/shopping_data_scaled_zscore.csv`
