# Location - Binary Encoding

This notebook implements **binary encoding** for the `Location` categorical feature.

## Why Binary Encoding for Location?

**Location has only 2 unique values:** `In-store` and `Online`

### Method Comparison:

| Alternative Method | Why NOT Suitable |
|-------------------|------------------|
| **One-Hot Encoding** | Overkill for binary variable; creates 2 columns when 1 is sufficient |
| **Target Encoding** | Unnecessarily complex for binary categorical |
| **Frequency Encoding** | Loses the actual categorical meaning (In-store vs Online) |

### Why Binary Encoding Works:
- Binary variables naturally map to 0/1
- Preserves the "online vs. offline" distinction
- Single column (efficient)
- Interpretable: **1 = Online, 0 = In-store**

**Expected Output:**
- New column: `Location_Encoded` (binary: 0 or 1)
- Distribution preserved from original Location column

In [11]:
import pandas as pd
from pathlib import Path


In [12]:

# Input and output paths
CSV_IN = "../../../handle_missing_data/output_data/4_discount_applied/final_cleaned_dataset.csv"
CSV_OUT = "../../output_data/2_location/location_binary_encoded.csv"

LOCATION = "Location"

# Load the cleaned dataset
df = pd.read_csv(CSV_IN)
data = df.copy()


## Binary Encoding Implementation

**Encoding rule:** `In-store = 0`, `Online = 1`

In [13]:
# Apply binary encoding
# In-store = 0, Online = 1
data['Location_Encoded'] = (data[LOCATION] == 'Online').astype(int)


print(f"\nOriginal Location vs Encoded:")
print(data[[LOCATION, 'Location_Encoded']].head(10))

print(f"\nEncoded value distribution:")
print(data['Location_Encoded'].value_counts().sort_index())


Original Location vs Encoded:
   Location  Location_Encoded
0  In-store                 0
1    Online                 1
2    Online                 1
3  In-store                 0
4  In-store                 0
5    Online                 1
6    Online                 1
7    Online                 1
8    Online                 1
9    Online                 1

Encoded value distribution:
Location_Encoded
0    5903
1    6068
Name: count, dtype: int64


## Validation

Verify binary encoding correctness:

In [14]:
valid_values = data['Location_Encoded'].isin([0, 1]).all()
print(f"1. Only contains 0/1 values: {valid_values}")

# 2. Check no missing values
no_missing = data['Location_Encoded'].isna().sum() == 0
print(f"2. No missing values: {no_missing}")

# 3. Verify mapping correctness
in_store_check = (data[df[LOCATION] == 'In-store']['Location_Encoded'] == 0).all()
online_check = (data[df[LOCATION] == 'Online']['Location_Encoded'] == 1).all()
print(f"3. In-store → 0 mapping correct: {in_store_check}")
print(f"4. Online → 1 mapping correct: {online_check}")


1. Only contains 0/1 values: True
2. No missing values: True
3. In-store → 0 mapping correct: True
4. Online → 1 mapping correct: True


## Save Output

Save the dataset with the encoded Location column:

In [15]:
# Create output directory if it doesn't exist
Path(CSV_OUT).parent.mkdir(parents=True, exist_ok=True)


data.to_csv(CSV_OUT, index=False)