In [12]:
import pandas as pd
import numpy as np


In [13]:
data = pd.read_csv('../../output/1_handle_missing_data/item_imputed.csv')

### Quantify missing Discount Applied

Determine the scale of missing entries in `Discount Applied` after STEP 3 completion. This is a binary categorical field (TRUE/FALSE) with substantial missingness (33%)


In [15]:
# Count missing Discount Applied values
# data['Discount Applied'].isna() creates a boolean Series where True indicates missing values
# .sum() counts the number of True values (i.e., missing entries)
missing_discount = data['Discount Applied'].isna()
# len(data) returns the total number of rows in the dataset
# .mean() on boolean Series gives the proportion of True values
print(f'Missing Discount Applied rows: {missing_discount.sum()} of {len(data)} ({missing_discount.mean():.2%})')

# Show distribution of non-missing values
print(f'\nDiscount Applied value distribution:')
# .value_counts() counts frequency of each unique value
# dropna=False includes NaN in the count
print(data['Discount Applied'].value_counts(dropna=False))

# Verify that critical columns are complete (from previous steps)
print(f'\nVerification of previous steps:')
print(f'Item missing: {data["Item"].isna().sum()}')
print(f'Price Per Unit missing: {data["Price Per Unit"].isna().sum()}')
print(f'Quantity missing: {data["Quantity"].isna().sum()}')
print(f'Total Spent missing: {data["Total Spent"].isna().sum()}')


Missing Discount Applied rows: 3988 of 11971 (33.31%)

Discount Applied value distribution:
Discount Applied
True     4019
NaN      3988
False    3964
Name: count, dtype: int64

Verification of previous steps:
Item missing: 0
Price Per Unit missing: 0
Quantity missing: 0
Total Spent missing: 0


### Missingness mechanism

Quantifying how often Discount Applied is missing across different dimensions to determine if the pattern is random (MCAR) or systematic (MAR/MNAR)


In [17]:
# Analyze missingness patterns across categories
summary = data.assign(missing_discount=missing_discount).groupby('Category')['missing_discount'].mean().sort_values(ascending=False)
print('Share of Discount Applied missing by Category:')
print(summary)

# Analyze missingness patterns across payment methods
payment_share = data.assign(missing_discount=missing_discount).groupby('Payment Method')['missing_discount'].mean().sort_values(ascending=False)
print('\nShare of Discount Applied missing by Payment Method:')
print(payment_share)

# Analyze missingness patterns across locations
location_share = data.assign(missing_discount=missing_discount).groupby('Location')['missing_discount'].mean().sort_values(ascending=False)
print('\nShare of Discount Applied missing by Location:')
print(location_share)

# Check variation to assess randomness
print('Variation Analysis (for MCAR assessment):')
# Calculate coefficient of variation for missingness rates
# Lower variation suggests more uniform distribution (MCAR)
# Higher variation suggests systematic pattern (MAR/MNAR)
category_std = summary.std()
category_mean = summary.mean()
cv_category = (category_std / category_mean) * 100 if category_mean > 0 else 0

print(f'Category missingness - Mean: {category_mean:.4f}, Std: {category_std:.4f}, CV: {cv_category:.2f}%')
print(f'Payment missingness - Range: {payment_share.min():.4f} to {payment_share.max():.4f}')
print(f'Location missingness - Range: {location_share.min():.4f} to {location_share.max():.4f}')

if cv_category < 10:
    print('\nMCAR (Missing Completely At Random)')
else:
    print('\MAR (Missing At Random)')


Share of Discount Applied missing by Category:
Category
Furniture                             0.355410
Beverages                             0.352941
Electric household essentials         0.340369
Patisserie                            0.337266
Food                                  0.331785
Butchers                              0.325535
Milk Products                         0.314607
Computers and electric accessories    0.306703
Name: missing_discount, dtype: float64

Share of Discount Applied missing by Payment Method:
Payment Method
Cash              0.340726
Credit Card       0.330023
Digital Wallet    0.328343
Name: missing_discount, dtype: float64

Share of Discount Applied missing by Location:
Location
In-store    0.335423
Online      0.330916
Name: missing_discount, dtype: float64
Variation Analysis (for MCAR assessment):
Category missingness - Mean: 0.3331, Std: 0.0172, CV: 5.15%
Payment missingness - Range: 0.3283 to 0.3407
Location missingness - Range: 0.3309 to 0.3354

MCAR (

### Observed value distribution analysis

Examine the distribution of observed non-missing values to understand the balance between TRUE and FALSE


In [18]:
# Analyze distribution of observed values
# Filter to non-missing values only
observed_values = data[data['Discount Applied'].notna()]['Discount Applied']

print('Distribution of observed (non-missing) Discount Applied values:')
print('=' * 80)
# .value_counts() counts frequency of each unique value
# .sort_index() sorts by the value itself (False, True) for consistent display
value_counts = observed_values.value_counts().sort_index()
print(value_counts)


Distribution of observed (non-missing) Discount Applied values:
Discount Applied
False    3964
True     4019
Name: count, dtype: int64


### Missing data classification

**Classification: MCAR (Missing Completely At Random)**

**Rationale:**
- Missingness is evenly distributed (~33%) across all categories, payment methods, and locations
- The distribution of TRUE/FALSE in observed data is nearly balanced (approximately 50/50)
- The missing pattern shows no relationship with other variables
- Low coefficient of variation indicates uniform missingness across groups
- The missingness does NOT depend on observed or unobserved data



### Handling strategy: Create Unknown category

**Justification for Unknown category (not deletion or imputation):**

1. **MCAR pattern:** Since missing is completely random, any handling method is theoretically valid
2. **Too much data to drop:** Deleting 33% of rows would lose 3,988 valuable transactions
3. **Preserves transparency:** Unknown category explicitly indicates missing information
4. **No false assumptions:** Avoids incorrectly imputing TRUE or FALSE when we do not know
5. **Maintains all other complete data:** All critical columns (Item, Price, Quantity, Total) are 100% complete

**Why Unknown (not other methods):**
- **Deletion:** Would lose 33% of dataset - too much valuable data
- **Random imputation:** Adds uncertainty and does not add information
- **Mode/Mean imputation:** Creates false certainty - we genuinely do not know the values
- **Predictive model:** Overly complex for MCAR data, no predictive signal available
- **Unknown category:** Most transparent and preserves all transaction data


In [19]:
# Display sample of rows with missing Discount Applied
print('Sample of rows with missing Discount Applied (to be handled):')

print(data[missing_discount][['Transaction ID', 'Category', 'Item', 'Total Spent', 'Discount Applied']].head(10))


Sample of rows with missing Discount Applied (to be handled):
   Transaction ID                            Category          Item  \
4     TXN_1004124  Computers and electric accessories    Item_7_CEA   
5     TXN_1004284                       Milk Products  Item_25_MILK   
7     TXN_1006123       Electric household essentials    Item_8_EHE   
8     TXN_1006129                       Milk Products  Item_17_MILK   
10    TXN_1007144                           Beverages    Item_2_BEV   
16    TXN_1010976  Computers and electric accessories   Item_12_CEA   
17    TXN_1011669                                Food  Item_13_FOOD   
18    TXN_1011882                          Patisserie   Item_21_PAT   
26    TXN_1015414                            Butchers   Item_23_BUT   
27    TXN_1016209                                Food  Item_25_FOOD   

    Total Spent Discount Applied  
4          70.0              NaN  
5         123.0              NaN  
7          15.5              NaN  
8         232.0 

In [24]:
# Count missing values before handling
# missing_discount.sum() gives the total number of missing Discount Applied values
discount_missing_before = missing_discount.sum()
print(f'Discount Applied missing before handling: {discount_missing_before}')

# Show distribution before handling
print('\nDistribution BEFORE handling:')
print(data['Discount Applied'].value_counts(dropna=False))

# Fill missing values with "Unknown" string
# .fillna('Unknown') replaces all NaN values with the string "Unknown"
# This creates a third category alongside True and False
data['Discount Applied'] = data['Discount Applied'].fillna('Unknown')

# Count missing values after handling
# data['Discount Applied'].isna().sum() recounts missing values after filling
discount_missing_after = data['Discount Applied'].isna().sum()
# Calculate how many values were handled
values_handled = discount_missing_before - discount_missing_after

print(f'\nDiscount Applied missing after handling: {discount_missing_after}')
print(f'Values handled (converted to "Unknown"): {values_handled}')
print(f'Handling success rate: {values_handled / discount_missing_before:.1%}')

# Show distribution after handling
print('\nDistribution AFTER handling:')
print(data['Discount Applied'].value_counts())


Discount Applied missing before handling: 3988

Distribution BEFORE handling:
Discount Applied
True     4019
NaN      3988
False    3964
Name: count, dtype: int64

Discount Applied missing after handling: 0
Values handled (converted to "Unknown"): 3988
Handling success rate: 100.0%

Distribution AFTER handling:
Discount Applied
True       4019
Unknown    3988
False      3964
Name: count, dtype: int64


### Validation: Verify all missing values handled

Verify that all missing values have been addressed and the dataset is now 100% complete


In [25]:
# Comprehensive missing value check across ALL columns
print('Final missing value check across ALL columns:')

# .isnull().sum() counts missing values for each column
missing_summary = data.isnull().sum()
# Filter to show only columns with missing values
missing_cols = missing_summary[missing_summary > 0]
print(f'\nMissing columns: {missing_cols}')


Final missing value check across ALL columns:

Missing columns: Series([], dtype: int64)


### Sample inspection: After handling

Display sample rows to verify the Unknown category was applied correctly


In [26]:
# Display sample of rows that were handled
print('Sample of rows after Discount Applied handling:')
# missing_discount is still the original boolean filter (before handling)
# Use it to show the same rows, now with "Unknown" values
sample_handled = data[missing_discount][['Transaction ID', 'Category', 'Item', 'Total Spent', 'Discount Applied']].head(10)
print(sample_handled)


Sample of rows after Discount Applied handling:
   Transaction ID                            Category          Item  \
4     TXN_1004124  Computers and electric accessories    Item_7_CEA   
5     TXN_1004284                       Milk Products  Item_25_MILK   
7     TXN_1006123       Electric household essentials    Item_8_EHE   
8     TXN_1006129                       Milk Products  Item_17_MILK   
10    TXN_1007144                           Beverages    Item_2_BEV   
16    TXN_1010976  Computers and electric accessories   Item_12_CEA   
17    TXN_1011669                                Food  Item_13_FOOD   
18    TXN_1011882                          Patisserie   Item_21_PAT   
26    TXN_1015414                            Butchers   Item_23_BUT   
27    TXN_1016209                                Food  Item_25_FOOD   

    Total Spent Discount Applied  
4          70.0          Unknown  
5         123.0          Unknown  
7          15.5          Unknown  
8         232.0          Unkno

### Persist final cleaned dataset

Save the completely cleaned dataset - this is the final output of the entire pipeline


In [27]:
# Save the final cleaned dataset to CSV
# to_csv writes the DataFrame to a CSV file
# index=False prevents writing row numbers as a column
# This is the FINAL output of the entire 4-step missing data handling pipeline
output_path = '../../output/1_handle_missing_data/final_cleaned_dataset.csv'
data.to_csv(output_path, index=False)

### Summary

**Discount Applied Handling**

**Classification:** MCAR (Missing Completely At Random)
- Missingness is evenly distributed (~33%) across all categories, payment methods, and locations
- The distribution of TRUE/FALSE in observed data is balanced (approximately 50/50)
- The missing pattern shows no relationship with other variables
- Data collection issue where field was randomly not recorded

**Method:** Create Unknown category
- Converted 3,988 missing values to "Unknown" string
- Preserves all transaction data (no rows dropped)
- Transparent handling - explicitly indicates unknown status

**Justification:**
- MCAR pattern allows any handling method
- Too much data to drop (33% of dataset)
- Unknown category is most transparent and honest
- Avoids false certainty from imputation
- Maintains all other complete columns


**Final Pipeline Results:**
- Original rows: 12,575
- Final rows: 11,971 (95.2% retention)

