# Price Per Unit - Rescaling Methods Comparison

## Overview

This notebook implements and compares three rescaling methods for the **Price Per Unit** attribute:
1. **Min-Max Normalization**
2. **Z-Score Standardization**
3. **Robust Scaling**

## Source Dataset

**Input:** `handle_missing_data/output_data/4_discount_applied/final_cleaned_dataset.csv`
- 11,971 rows (after missing data handling)
- Price Per Unit reconstructed from Total Spent ÷ Quantity
- All missing values resolved

## Price Per Unit Characteristics

- **Type:** Continuous currency data
- **Range:** Typically $1-$100
- **Distribution:** Varies by item category
- **Outliers:** Present but handled during reconstruction
- **Recommendation:** Min-Max Normalization (natural bounds, interpretable)

---

## Step 1: Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Configuration
INPUT_CSV = Path('../../output/1_handle_missing_data/final_cleaned_dataset.csv')
OUTPUT_DIR = Path('../../output/3_handle_rescale_data')
PRICE_PER_UNIT_COLUMN = 'Price Per Unit'

# Create output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Input file: {INPUT_CSV}")
print(f"Output directory: {OUTPUT_DIR}")

In [2]:
# Load the cleaned dataset
df = pd.read_csv(INPUT_CSV)

df.head()

Unnamed: 0,Transaction ID,Customer ID,Category,Item,Price Per Unit,Quantity,Total Spent,Payment Method,Location,Transaction Date,Discount Applied
0,TXN_1002182,CUST_01,Food,Item_5_FOOD,11.0,5.0,55.0,Digital Wallet,In-store,2024-10-08,True
1,TXN_1003865,CUST_15,Furniture,Item_2_FUR,6.5,5.0,32.5,Cash,Online,2022-03-12,False
2,TXN_1003940,CUST_06,Furniture,Item_5_FUR,11.0,9.0,99.0,Digital Wallet,Online,2022-04-22,False
3,TXN_1004091,CUST_04,Food,Item_25_FOOD,41.0,3.0,123.0,Cash,In-store,2023-11-09,False
4,TXN_1004124,CUST_08,Computers and electric accessories,Item_7_CEA,14.0,5.0,70.0,Credit Card,In-store,2022-03-02,Unknown


## Step 2: Exploratory Data Analysis - Price Per Unit

In [None]:
# Basic statistics
print("PRICE PER UNIT - Descriptive Statistics")
print(f"\n{df[PRICE_PER_UNIT_COLUMN].describe()}")

# Additional statistics
print(f"\nMissing values: {df[PRICE_PER_UNIT_COLUMN].isna().sum()}")
print(f"Unique values: {df[PRICE_PER_UNIT_COLUMN].nunique()}")
print(f"\nPrice range: ${df[PRICE_PER_UNIT_COLUMN].min():.2f} - ${df[PRICE_PER_UNIT_COLUMN].max():.2f}")

In [None]:
# Outlier detection using IQR method
Q1 = df[PRICE_PER_UNIT_COLUMN].quantile(0.25)
Q3 = df[PRICE_PER_UNIT_COLUMN].quantile(0.75)
IQR = Q3 - Q1

lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

outliers = df[(df[PRICE_PER_UNIT_COLUMN] < lower_fence) | (df[PRICE_PER_UNIT_COLUMN] > upper_fence)]

print("OUTLIER DETECTION (IQR Method)")
print(f"Q1 (25th percentile): ${Q1:.2f}")
print(f"Q3 (75th percentile): ${Q3:.2f}")
print(f"IQR: ${IQR:.2f}")
print(f"Lower fence: ${lower_fence:.2f}")
print(f"Upper fence: ${upper_fence:.2f}")
print(f"\nNumber of outliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")

if len(outliers) > 0:
    print(f"Outlier price range: ${outliers[PRICE_PER_UNIT_COLUMN].min():.2f} to ${outliers[PRICE_PER_UNIT_COLUMN].max():.2f}")

In [None]:
# Visualize original distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram
axes[0].hist(df[PRICE_PER_UNIT_COLUMN], bins=40, color='lightblue', edgecolor='black')
axes[0].set_title('Price Per Unit - Histogram', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Price ($)')
axes[0].set_ylabel('Frequency')
axes[0].axvline(df[PRICE_PER_UNIT_COLUMN].mean(), color='red', linestyle='--', 
                label=f'Mean: ${df[PRICE_PER_UNIT_COLUMN].mean():.2f}')
axes[0].axvline(df[PRICE_PER_UNIT_COLUMN].median(), color='green', linestyle='--', 
                label=f'Median: ${df[PRICE_PER_UNIT_COLUMN].median():.2f}')
axes[0].legend()

# Box plot
axes[1].boxplot(df[PRICE_PER_UNIT_COLUMN], vert=True)
axes[1].set_title('Price Per Unit - Box Plot', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Price ($)')
axes[1].axhline(upper_fence, color='red', linestyle='--', label=f'Upper Fence: ${upper_fence:.2f}')
axes[1].axhline(lower_fence, color='red', linestyle='--', label=f'Lower Fence: ${lower_fence:.2f}')
axes[1].legend()

# Distribution plot
df[PRICE_PER_UNIT_COLUMN].plot(kind='kde', ax=axes[2], color='darkblue')
axes[2].set_title('Price Per Unit - Density Plot', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Price ($)')
axes[2].set_ylabel('Density')

plt.tight_layout()
plt.show()

print("\nDistribution Analysis:")
skewness = df[PRICE_PER_UNIT_COLUMN].skew()
print(f"Skewness: {skewness:.4f}")
if abs(skewness) < 0.5:
    print("Approximately symmetric distribution")
elif skewness > 0:
    print("Right-skewed distribution (some high-priced items)")
else:
    print("Left-skewed distribution")

## Step 3: Apply Rescaling Methods

We will apply all three methods and compare them.

### Method 1: Min-Max Normalization (RECOMMENDED)

In [None]:
# Min-Max Normalization
min_max_scaler = MinMaxScaler()
df['PricePerUnit_Normalized'] = min_max_scaler.fit_transform(df[[PRICE_PER_UNIT_COLUMN]])

print("MIN-MAX NORMALIZATION")
print(df['PricePerUnit_Normalized'].describe())

# Show examples
print(f"\nExample transformations:")
print(df[[PRICE_PER_UNIT_COLUMN, 'PricePerUnit_Normalized']].head(10))

### Method 2: Z-Score Standardization

In [None]:
# Z-Score Standardization
standard_scaler = StandardScaler()
df['PricePerUnit_Standardized'] = standard_scaler.fit_transform(df[[PRICE_PER_UNIT_COLUMN]])

print("Z-SCORE STANDARDIZATION")
print(df['PricePerUnit_Standardized'].describe())

# Show examples
print(f"\nExample transformations:")
print(df[[PRICE_PER_UNIT_COLUMN, 'PricePerUnit_Standardized']].head(10))

### Method 3: Robust Scaling

In [None]:
# Robust Scaling
robust_scaler = RobustScaler()
df['PricePerUnit_Robust'] = robust_scaler.fit_transform(df[[PRICE_PER_UNIT_COLUMN]])

print("ROBUST SCALING")
print(df['PricePerUnit_Robust'].describe())

# Show examples
print(f"\nExample transformations:")
print(df[[PRICE_PER_UNIT_COLUMN, 'PricePerUnit_Robust']].head(10))

## Step 4: Compare All Methods

In [None]:
# Comparison table
comparison = pd.DataFrame({
    'Method': ['Original', 'Normalization (RECOMMENDED)', 'Standardization', 'Robust Scaling'],
    'Column': [PRICE_PER_UNIT_COLUMN, 'PricePerUnit_Normalized', 'PricePerUnit_Standardized', 'PricePerUnit_Robust'],
    'Min': [df[PRICE_PER_UNIT_COLUMN].min(), df['PricePerUnit_Normalized'].min(), 
            df['PricePerUnit_Standardized'].min(), df['PricePerUnit_Robust'].min()],
    'Max': [df[PRICE_PER_UNIT_COLUMN].max(), df['PricePerUnit_Normalized'].max(), 
            df['PricePerUnit_Standardized'].max(), df['PricePerUnit_Robust'].max()],
    'Mean': [df[PRICE_PER_UNIT_COLUMN].mean(), df['PricePerUnit_Normalized'].mean(), 
             df['PricePerUnit_Standardized'].mean(), df['PricePerUnit_Robust'].mean()],
    'Median': [df[PRICE_PER_UNIT_COLUMN].median(), df['PricePerUnit_Normalized'].median(), 
               df['PricePerUnit_Standardized'].median(), df['PricePerUnit_Robust'].median()],
    'Std': [df[PRICE_PER_UNIT_COLUMN].std(), df['PricePerUnit_Normalized'].std(), 
            df['PricePerUnit_Standardized'].std(), df['PricePerUnit_Robust'].std()]
})

print("COMPARISON OF ALL METHODS")
print(comparison.to_string(index=False))

In [None]:
# Visualize all methods side by side
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Original
axes[0, 0].hist(df[PRICE_PER_UNIT_COLUMN], bins=40, color='lightblue', edgecolor='black')
axes[0, 0].set_title('Original Price Per Unit', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Price ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].axvline(df[PRICE_PER_UNIT_COLUMN].mean(), color='red', linestyle='--', 
                   label=f'Mean: ${df[PRICE_PER_UNIT_COLUMN].mean():.2f}')
axes[0, 0].legend()

# Normalized
axes[0, 1].hist(df['PricePerUnit_Normalized'], bins=40, color='lightgreen', edgecolor='black')
axes[0, 1].set_title('Min-Max Normalized [0, 1] ⭐ RECOMMENDED', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Normalized Value')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].axvline(df['PricePerUnit_Normalized'].mean(), color='red', linestyle='--', 
                   label=f'Mean: {df["PricePerUnit_Normalized"].mean():.2f}')
axes[0, 1].legend()

# Standardized
axes[1, 0].hist(df['PricePerUnit_Standardized'], bins=40, color='lightcoral', edgecolor='black')
axes[1, 0].set_title('Z-Score Standardized (μ=0, σ=1)', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Standardized Value')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].axvline(0, color='green', linestyle='--', label='Mean: 0')
axes[1, 0].legend()

# Robust Scaled
axes[1, 1].hist(df['PricePerUnit_Robust'], bins=40, color='lavender', edgecolor='black')
axes[1, 1].set_title('Robust Scaled (median=0, IQR=1)', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Robust Scaled Value')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].axvline(0, color='green', linestyle='--', label='Median: 0')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

## Step 5: Interpretability Analysis

In [None]:
# Analyze interpretability for Price Per Unit
print("INTERPRETABILITY ANALYSIS")

# Sample prices and their scaled values
sample_prices = [df[PRICE_PER_UNIT_COLUMN].min(), 
                 df[PRICE_PER_UNIT_COLUMN].quantile(0.25),
                 df[PRICE_PER_UNIT_COLUMN].median(),
                 df[PRICE_PER_UNIT_COLUMN].quantile(0.75),
                 df[PRICE_PER_UNIT_COLUMN].max()]

print("Price -> Normalized Value Mapping:")
for price in sample_prices:
    normalized = (price - df[PRICE_PER_UNIT_COLUMN].min()) / (df[PRICE_PER_UNIT_COLUMN].max() - df[PRICE_PER_UNIT_COLUMN].min())
    print(f"${price:.2f} → {normalized:.4f}")



## Step 6: Save All Rescaled Datasets

In [None]:
# Save all three versions as per assignment requirements

# 1. Normalization
df_norm = df.copy()
output_norm = OUTPUT_DIR / 'data_rescaling_norm_price_per_unit.csv'
df_norm.to_csv(output_norm, index=False)

# 2. Standardization
df_std = df.copy()
output_std = OUTPUT_DIR / 'data_rescaling_std_price_per_unit.csv'
df_std.to_csv(output_std, index=False)

# 3. Robust Scaling
df_robust = df.copy()
output_robust = OUTPUT_DIR / 'data_rescaling_robust_price_per_unit.csv'
df_robust.to_csv(output_robust, index=False)


## Step 8: Validation

In [None]:
# Validation checks
print("VALIDATION CHECKS")

print("\n1. RANGE VALIDATION:")
norm_in_range = (df['PricePerUnit_Normalized'] >= 0).all() and (df['PricePerUnit_Normalized'] <= 1).all()
print(f"Normalization in [0, 1]: {norm_in_range}")

std_range = df['PricePerUnit_Standardized'].abs().max()
print(f"Standardization range: [{df['PricePerUnit_Standardized'].min():.2f}, {df['PricePerUnit_Standardized'].max():.2f}]")

robust_range = df['PricePerUnit_Robust'].abs().max()
print(f"Robust scaling range: [{df['PricePerUnit_Robust'].min():.2f}, {df['PricePerUnit_Robust'].max():.2f}]")


## Summary

### Methods Applied:
1. **Min-Max Normalization** -> Range [0, 1]
2. **Z-Score Standardization** -> Mean=0, Std=1
3. **Robust Scaling** -> Median=0, IQR=1

### Recommendation:
**Min-Max Normalization** is recommended for Price Per Unit because:
- Natural bounds (prices > $0 with practical upper limit)
- Outliers already handled during reconstruction (Total Spent ÷ Quantity)
- Bounded [0, 1] range is interpretable and ML-friendly
- Preserves price relationships and semantics
- No negative values (semantically correct for prices)

### Output Files:
- `data_rescaling_norm_price_per_unit.csv` - Min-Max Normalization
- `data_rescaling_std_price_per_unit.csv` - Z-Score Standardization
- `data_rescaling_robust_price_per_unit.csv` - Robust Scaling