# Discount Applied - One-Hot Encoding


## Why One-Hot Encoding for Discount Applied?

**Discount Applied has 3 distinct categories:** `True`, `False`, `Unknown`

### Method Comparison:

| Alternative Method | Why NOT Suitable |
|-------------------|------------------|
| **Label Encoding** | False ordering: False(0) < True(1) < Unknown(2) makes no sense |
| **Binary (0/1)** | Can't represent 3 categories with single binary column |
| **Target Encoding** | Overly complex; discount is already related to Total Spent |
| **Ordinal Encoding** | No natural order exists between True/False/Unknown |

### Why One-Hot Encoding Works:
- Treats "Unknown" as legitimate category (not missing data)
- No false relationships introduced
- Preserves MCAR information from missing data analysis
- Each state is independent

**Expected Output:**
- 3 new columns: `Discount_True`, `Discount_False`, `Discount_Unknown`
- Each row has exactly one "1" and two "0"s
- Distribution: ~33% each (from missing data analysis)

In [None]:
import pandas as pd
from pathlib import Path

In [None]:
# Input and output paths
CSV_IN = "../../output/1_handle_missing_data/final_cleaned_dataset.csv"
CSV_OUT = "../../output/2_handle_encoding_data/discount_applied_one_hot_encoded.csv"

DISCOUNT_APPLIED = "Discount Applied"

# Load the cleaned dataset
df = pd.read_csv(CSV_IN)
data = df.copy()

## One-Hot Encoding Implementation

**Encoding rule:** Create 3 binary columns: `Discount_True`, `Discount_False`, `Discount_Unknown`

In [3]:
# Apply one-hot encoding
discount_encoded = pd.get_dummies(data[DISCOUNT_APPLIED], prefix='Discount', drop_first=False)

# Add encoded columns to dataframe
data = pd.concat([data, discount_encoded], axis=1)

print(f"\nOriginal Discount Applied vs Encoded:")
print(data[[DISCOUNT_APPLIED, 'Discount_False', 'Discount_True', 'Discount_Unknown']].head(10))


Original Discount Applied vs Encoded:
  Discount Applied  Discount_False  Discount_True  Discount_Unknown
0             True           False           True             False
1            False            True          False             False
2            False            True          False             False
3            False            True          False             False
4          Unknown           False          False              True
5          Unknown           False          False              True
6            False            True          False             False
7          Unknown           False          False              True
8          Unknown           False          False              True
9             True           False           True             False


## Validation

Verify one-hot encoding correctness:

In [4]:
# 1. Check one-hot columns sum to 1 per row
one_hot_sum = data[['Discount_False', 'Discount_True', 'Discount_Unknown']].sum(axis=1)
sum_check = (one_hot_sum == 1).all()
print(f"1. Each row has exactly one '1' across encoded columns: {sum_check}")

# 2. Check no missing values
no_missing = data[['Discount_False', 'Discount_True', 'Discount_Unknown']].isna().sum().sum() == 0
print(f"2. No missing values in encoded columns: {no_missing}")

# 3. Check only binary values (0 or 1)
binary_check = data[['Discount_False', 'Discount_True', 'Discount_Unknown']].isin([0, 1]).all().all()
print(f"3. Only contains 0/1 values: {binary_check}")


1. Each row has exactly one '1' across encoded columns: True
2. No missing values in encoded columns: True
3. Only contains 0/1 values: True


## Save Output

Save the dataset with the encoded Discount Applied columns:

In [5]:
# Create output directory if it doesn't exist
Path(CSV_OUT).parent.mkdir(parents=True, exist_ok=True)

# Save the encoded dataset
data.to_csv(CSV_OUT, index=False)