# Payment Method - One-Hot Encoding

## Encoding Strategy

**Method:** One-Hot Encoding

**Why One-Hot Encoding:**
- Only 3 categories (Cash, Credit Card, Digital Wallet) → manageable dimensionality
- No ordinal relationship between payment methods (they are independent choices)
- Standard industry best practice for low-cardinality nominal variables
- Prepares data for ML algorithms that require numerical input

**Why NOT other methods:**

| Alternative Method | Why NOT Suitable |
|-------------------|------------------|
| Label Encoding | Implies false ordering: Cash(0) < Credit(1) < Digital(2) |
| Target Encoding | Overcomplicates simple 3-category variable; risk of overfitting |
| Frequency Encoding | Loses categorical identity; only shows popularity |
| Binary Encoding | More than 2 categories → can't use simple 0/1 |

**Expected Output:**
- 3 new columns: `Payment_Cash`, `Payment_Credit Card`, `Payment_Digital Wallet`
- Each row has exactly one "1" and two "0"s
- All ML algorithms can process these binary columns

In [1]:
import pandas as pd

# Load the cleaned dataset (sources for ALL encoding steps)
df = pd.read_csv('../../../handle_missing_data/output_data/4_discount_applied/final_cleaned_dataset.csv')
data = df.copy()

## Step 1: Analyze Payment Method Distribution

In [2]:
# Check unique payment methods and their distribution
print("Unique Payment Methods:")
print(data['Payment Method'].value_counts())
print("\nDistribution (%):")
print(data['Payment Method'].value_counts(normalize=True) * 100)
print(f"\nTotal unique values: {data['Payment Method'].nunique()}")
print(f"Missing values: {data['Payment Method'].isna().sum()}")

Unique Payment Methods:
Payment Method
Cash              4103
Digital Wallet    3941
Credit Card       3927
Name: count, dtype: int64

Distribution (%):
Payment Method
Cash              34.274497
Digital Wallet    32.921226
Credit Card       32.804277
Name: proportion, dtype: float64

Total unique values: 3
Missing values: 0


## Step 2: Apply One-Hot Encoding

In [3]:
# Apply one-hot encoding using pd.get_dummies
# drop_first=False to keep all 3 columns for interpretability
payment_encoded = pd.get_dummies(data['Payment Method'], prefix='Payment', drop_first=False)


print(payment_encoded.columns.tolist())
print("\nFirst few rows of encoded data:")
print(payment_encoded.head(10))

['Payment_Cash', 'Payment_Credit Card', 'Payment_Digital Wallet']

First few rows of encoded data:
   Payment_Cash  Payment_Credit Card  Payment_Digital Wallet
0         False                False                    True
1          True                False                   False
2         False                False                    True
3          True                False                   False
4         False                 True                   False
5          True                False                   False
6         False                 True                   False
7          True                False                   False
8         False                False                    True
9          True                False                   False


## Step 3: Validation - Verify One-Hot Encoding Properties

In [4]:
# Validation 1: Each row should sum to exactly 1 (one payment method per transaction)
row_sums = payment_encoded.sum(axis=1)
print("Row sum validation:")
print(f"All rows sum to 1: {(row_sums == 1).all()}")
print(f"Min sum: {row_sums.min()}, Max sum: {row_sums.max()}")

# Validation 2: Only binary values (0 or 1)
print("\nBinary validation:")
for col in payment_encoded.columns:
    unique_vals = payment_encoded[col].unique()
    print(f"{col}: {sorted(unique_vals)}")

# Validation 3: No missing values
print(f"\nMissing values: {payment_encoded.isna().sum().sum()}")

# Validation 4: Total counts match original distribution
print("\nDistribution check:")
for col in payment_encoded.columns:
    original_method = col.replace('Payment_', '')
    encoded_count = payment_encoded[col].sum()
    original_count = (data['Payment Method'] == original_method).sum()
    print(f"{original_method}: Encoded={encoded_count}, Original={original_count}, Match={encoded_count == original_count}")

Row sum validation:
All rows sum to 1: True
Min sum: 1, Max sum: 1

Binary validation:
Payment_Cash: [np.False_, np.True_]
Payment_Credit Card: [np.False_, np.True_]
Payment_Digital Wallet: [np.False_, np.True_]

Missing values: 0

Distribution check:
Cash: Encoded=4103, Original=4103, Match=True
Credit Card: Encoded=3927, Original=3927, Match=True
Digital Wallet: Encoded=3941, Original=3941, Match=True


## Step 4: Combine with Original Dataset

In [5]:
# Combine encoded columns with original dataset
data_encoded = pd.concat([data, payment_encoded], axis=1)

print("\nColumns added:")
print([col for col in data_encoded.columns if col.startswith('Payment_')])
print(data_encoded[['Payment Method', 'Payment_Cash', 'Payment_Credit Card', 'Payment_Digital Wallet']].head(10))


Columns added:
['Payment_Cash', 'Payment_Credit Card', 'Payment_Digital Wallet']
   Payment Method  Payment_Cash  Payment_Credit Card  Payment_Digital Wallet
0  Digital Wallet         False                False                    True
1            Cash          True                False                   False
2  Digital Wallet         False                False                    True
3            Cash          True                False                   False
4     Credit Card         False                 True                   False
5            Cash          True                False                   False
6     Credit Card         False                 True                   False
7            Cash          True                False                   False
8  Digital Wallet         False                False                    True
9            Cash          True                False                   False


## Step 5: Save Encoded Dataset

In [6]:
import os

# Create output directory if it doesn't exist
output_dir = '../../output_data/3_payment_method'
os.makedirs(output_dir, exist_ok=True)

# Save the encoded dataset
output_path = os.path.join(output_dir, 'encoded_payment_method_dataset.csv')
data_encoded.to_csv(output_path, index=False)

## Summary

**Encoding Applied:** One-Hot Encoding for Payment Method

**Results:**
- Created 3 binary columns: `Payment_Cash`, `Payment_Credit Card`, `Payment_Digital Wallet`
- Each row has exactly one "1" (active payment method) and two "0"s
- No missing values introduced
- No false ordinal relationships created
- Distribution preserved from original data

**Why One-Hot Encoding was chosen:**
1. Only 3 categories → manageable dimensionality (vs. 250 for Item)
2. No ordinal relationship between payment methods
3. Standard best practice for low-cardinality nominal variables
4. Each payment method becomes an independent binary feature
5. Compatible with all ML algorithms

**Output File:** `Handle encoding data/output_data/4_payment_method/encoded_payment_method_dataset.csv`