In [40]:
import pandas as pd
import numpy as np


In [41]:
# Input file after dropping rows with missing Total Spent
CSV_IN = "../../output_data/1_total_spent/total_spent_cleaned.csv"
CSV_OUT = "../../output_data/2_price_per_unit/price_per_unit_reconstructed.csv"
# Define columns name
TOTAL_SPENT = "Total Spent"
PRICE_PER_UNIT = "Price Per Unit"
QUANTITY = "Quantity"
CATEGORY = "Category"
PAYMENT_METHOD = "Payment Method"
LOCATION = "Location"
ITEM = "Item"
TRANSACTION_ID = "Transaction ID"

# Define error
COERCE_ERRORS = "coerce"

In [42]:
df = pd.read_csv(CSV_IN)

### Quantify missing Price Per Unit

Determine the scale of missing entries in `Price Per Unit` after STEP 1 deletion. Assess whether deterministic reconstruction using the formula `Price = Total Spent ÷ Quantity` is feasible.


In [43]:
# Convert column's values to numeric, coercing errors to NaN
# col is each of the relevant columns
for col in [PRICE_PER_UNIT, QUANTITY, TOTAL_SPENT]:
    # pd.to_numeric converts values in col to numeric, with errors coerced to NaN
    # df[col] accesses the target column
    # errors=COERCE_ERRORS specifies that parsing errors are set to NaN
    df[col] = pd.to_numeric(df[col], errors = COERCE_ERRORS)

# Count missing Price Per Unit values
# df[PRICE_PER_UNIT].isna() creates a boolean Series where True indicates missing values
# .sum() counts the number of True values (i.e., missing entries)
missingPrice = df[PRICE_PER_UNIT].isna()
# len(df) returns the total number of rows in the dataset
# .mean() on boolean Series gives the proportion of True values
print(f'Missing Price Per Unit rows: {missingPrice.sum()} of {len(df)} ({missingPrice.mean():.2%})')

# Verify that Total Spent and Quantity are complete (from STEP 1)
print(f'\nTotal Spent missing: {df[TOTAL_SPENT].isna().sum()} (should be 0)')
print(f'Quantity missing: {df[QUANTITY].isna().sum()} (should be 0)')


Missing Price Per Unit rows: 609 of 11971 (5.09%)

Total Spent missing: 0 (should be 0)
Quantity missing: 0 (should be 0)


### Missingness mechanism

Quantifying how often `Price Per Unit` is missing within each `Category`, payment method, and location to understand if the pattern is random or systematic. This helps confirm the MAR classification from the overall analysis.


In [44]:
# Analyze missingness patterns across categories
# .assign creates a new column missingPrice with the boolean missing indicator
# .groupby(CATEGORY) groups all rows by their category value
# [missingPrice].mean() calculates the proportion of missing values per category
# .sort_values(ascending=False) sorts categories by missing proportion (highest first)
summary = df.assign(missingPrice=missingPrice).groupby(CATEGORY)['missingPrice'].mean().sort_values(ascending=False) * 100
print('Share of Price Per Unit missing by Category %:')
print(summary.round(2))

# Analyze missingness patterns across payment methods
# Same logic as above, but grouped by 'Payment Method'
paymentShare = df.assign(missingPrice=missingPrice).groupby(PAYMENT_METHOD)['missingPrice'].mean().sort_values(ascending=False) * 100
print('\nShare of Price Per Unit missing by Payment Method %:')
print(paymentShare.round(2))

# Analyze missingness patterns across locations
# Same logic as above, but grouped by 'Location'
locationShare = df.assign(missingPrice=missingPrice).groupby(LOCATION)['missingPrice'].mean().sort_values(ascending=False) * 100
print('\nShare of Price Per Unit missing by Location %:')
print(locationShare.round(2))


Share of Price Per Unit missing by Category %:
Category
Milk Products                         5.82
Computers and electric accessories    5.42
Food                                  5.37
Electric household essentials         5.21
Butchers                              5.01
Patisserie                            5.00
Beverages                             4.61
Furniture                             4.26
Name: missingPrice, dtype: float64

Share of Price Per Unit missing by Payment Method %:
Payment Method
Digital Wallet    5.71
Credit Card       5.04
Cash              4.53
Name: missingPrice, dtype: float64

Share of Price Per Unit missing by Location %:
Location
Online      5.32
In-store    4.84
Name: missingPrice, dtype: float64


### Co-missingness analysis

Examining whether `Price Per Unit` is missing alongside other key fields, particularly `Item`. This helps confirm the systematic pattern identified in the overall analysis.


In [45]:
# Check co-missingness with Item
# missingPrice & df['Item'].isna() creates boolean Series that is True only when BOTH are missing
# .sum() counts how many rows have both fields missing
# Perfect overlap would mean all missing Price Per Unit also have missing Item
itemOverlap = (missingPrice & df[ITEM].isna()).sum()
print(f'Rows with both Price Per Unit and Item missing: {itemOverlap}')
print(f'Price Per Unit missing: {missingPrice.sum()}')
print(f'Perfect overlap: {itemOverlap == missingPrice.sum()}')
print(f'Overlap percentage: {itemOverlap / missingPrice.sum():.1%}')

# Check co-missingness with Total Spent (should be 0 after STEP 1)
# Count rows where both Price Per Unit and Total Spent are missing
totalOverlap = (missingPrice & df[TOTAL_SPENT].isna()).sum()
print(f'\nRows with both Price Per Unit and Total Spent missing: {totalOverlap} (should be 0)')

# Check co-missingness with Quantity (should be 0 after STEP 1)
# Count rows where both Price Per Unit and Quantity are missing
qtyOverlap = (missingPrice & df[QUANTITY].isna()).sum()
print(f'Rows with both Price Per Unit and Quantity missing: {qtyOverlap} (should be 0)')


Rows with both Price Per Unit and Item missing: 609
Price Per Unit missing: 609
Perfect overlap: True
Overlap percentage: 100.0%

Rows with both Price Per Unit and Total Spent missing: 0 (should be 0)
Rows with both Price Per Unit and Quantity missing: 0 (should be 0)


### Reconstructability assessment

Since `Price Per Unit = Total Spent ÷ Quantity`, assess how many missing `Price Per Unit` values can be deterministically reconstructed. After STEP 1, both Total Spent and Quantity are guaranteed to be complete, making 100% reconstruction possible.


In [46]:
# Check if Price Per Unit can be reconstructed from Total Spent and Quantity
# For reconstruction, we need Price Per Unit to be missing BUT both Total and Quantity to be present
# missingPrice ensures Price Per Unit is missing
# df[TOTAAL_SPENT].notna() ensures Total Spent is NOT missing
# df[QUANTITY].notna() ensures Quantity is NOT missing
# & combines all three conditions (all must be True)
reconstructable = missingPrice & df[TOTAL_SPENT].notna() & df[QUANTITY].notna()
reconstructableCount = reconstructable.sum()

print(f'Missing Price Per Unit that CAN be reconstructed: {reconstructableCount} out of {missingPrice.sum()}')
print(f'Reconstruction rate: {reconstructableCount / missingPrice.sum():.1%}')

# Check for any cases where Quantity is zero (division by zero issue)
# df[QUANTITY] == 0 creates boolean Series where True indicates zero quantity
# This would cause division by zero when calculating Price = Total / Quantity
zeroQty = missingPrice & (df[QUANTITY] == 0)
zeroQtyCount = zeroQty.sum()

if zeroQtyCount > 0:
    print(f'\n⚠ Warning: {zeroQtyCount} rows have missing Price with Quantity = 0')
    print('  These cannot be reconstructed due to division by zero')
else:
    print(f'\n✓ No division by zero issues: All {reconstructableCount} missing prices can be safely reconstructed')


Missing Price Per Unit that CAN be reconstructed: 609 out of 609
Reconstruction rate: 100.0%

✓ No division by zero issues: All 609 missing prices can be safely reconstructed


### Missing data classification

**Classification: MAR (Missing At Random)**

**Rationale:**
- **Perfect overlap with Item field:** All 609 missing Price Per Unit values occur when Item is also missing (100% co-missingness)
- Missing rates vary by category (4.09%-5.56%), with Milk Products showing the highest rate
- The missingness depends on the Item field (an observable variable)
- Not MCAR because missing rates are not uniform across categories
- Not MNAR because the missingness is explained by the Item field status, not by the price values themselves

**Key finding:** When `Item` was not recorded during data collection, `Price Per Unit` was also systematically omitted. However, since both `Total Spent` and `Quantity` are present, we can deterministically reconstruct all missing prices with zero estimation error.


### Handling strategy: Deterministic imputation

**Justification for formula-based reconstruction (not statistical imputation):**

1. **Mathematical relationship exists:** `Price Per Unit = Total Spent ÷ Quantity` is a known, exact formula
2. **Zero estimation error:** This is not imputation—it's reconstruction of a calculable value
3. **100% reconstructable:** All 609 missing prices can be calculated exactly (both Total and Quantity present)
4. **Maintains data integrity:** The reconstructed values perfectly satisfy the mathematical relationship
5. **Best practice:** When deterministic relationships exist, always use them before statistical methods

**Why not statistical imputation:**
- No need for mean/median/mode imputation when exact values can be calculated
- Statistical methods introduce estimation error; formula-based reconstruction has zero error
- Maintains perfect mathematical consistency across the dataset


In [47]:
# Display sample of rows to be imputed
print('Sample of rows with missing Price Per Unit (to be reconstructed):')
print('=' * 80)
# df[missingPrice] filters to show only rows where Price Per Unit is missing
# .head(10) shows the first 10 such rows
# This allows visual inspection of the data before reconstruction
print(df[missingPrice][[TRANSACTION_ID, CATEGORY, ITEM, PRICE_PER_UNIT, QUANTITY, TOTAL_SPENT]].head(10))

print('\nObservations about rows to be reconstructed:')
print('- All have missing Item (100% co-missingness)')
print('- All have complete Total Spent and Quantity')
print('- Price Per Unit can be calculated: Total Spent ÷ Quantity')
print('- Zero estimation error (deterministic reconstruction)')


Sample of rows with missing Price Per Unit (to be reconstructed):
    Transaction ID                            Category Item  Price Per Unit  \
12     TXN_1007496                            Butchers  NaN             NaN   
50     TXN_1032287                                Food  NaN             NaN   
68     TXN_1044590       Electric household essentials  NaN             NaN   
70     TXN_1046262                       Milk Products  NaN             NaN   
71     TXN_1046367  Computers and electric accessories  NaN             NaN   
76     TXN_1051223                          Patisserie  NaN             NaN   
87     TXN_1058643                                Food  NaN             NaN   
104    TXN_1071762                           Beverages  NaN             NaN   
134    TXN_1095879                           Beverages  NaN             NaN   
136    TXN_1096977                                Food  NaN             NaN   

     Quantity  Total Spent  
12       10.0        155.0  
50    

In [48]:
# Count missing values before imputation
# missingPrice.sum() gives the total number of missing Price Per Unit values
priceMissingBefore = missingPrice.sum()
print(f'Price Per Unit missing before reconstruction: {priceMissingBefore}')

# Perform deterministic imputation using the mathematical formula
# Create a filter for rows where Price Per Unit is missing
# .loc[filter, column] allows us to update specific rows and columns
# df.loc[missingPrice, TOTAL_SPENT] / df.loc[missingPrice, QUANTITY] performs element-wise division
# This calculates: Price = Total ÷ Quantity for each missing row
df.loc[missingPrice, PRICE_PER_UNIT] = df.loc[missingPrice, TOTAL_SPENT] / df.loc[missingPrice, QUANTITY]

# Count missing values after imputation
# df[PRICE_PER_UNIT].isna().sum() recounts missing values after reconstruction
priceMissingAfter = df[PRICE_PER_UNIT].isna().sum()
# Calculate how many values were successfully reconstructed
valuesReconstructed = priceMissingBefore - priceMissingAfter

print(f'\nPrice Per Unit missing after reconstruction: {priceMissingAfter}')
print(f'Values successfully reconstructed: {valuesReconstructed}')
print(f'Reconstruction success rate: {valuesReconstructed / priceMissingBefore:.1%}')


Price Per Unit missing before reconstruction: 609

Price Per Unit missing after reconstruction: 0
Values successfully reconstructed: 609
Reconstruction success rate: 100.0%


### Validation: Verify reconstruction accuracy

Verify that the reconstructed Price Per Unit values are correct by checking mathematical consistency across ALL rows (both originally complete and newly reconstructed).


In [49]:
# Verify Price Per Unit is now complete
print('Missing value check after reconstruction:')
print('=' * 80)
print(f'{PRICE_PER_UNIT} missing: {df[PRICE_PER_UNIT].isna().sum()}')
print(f'{TOTAL_SPENT} missing: {df[TOTAL_SPENT].isna().sum()}')
print(f'{QUANTITY} missing: {df[QUANTITY].isna().sum()}')
print(f'\n✓ {PRICE_PER_UNIT} is now 100% complete: {df[PRICE_PER_UNIT].isna().sum() == 0}')

# Mathematical consistency check
# Verify that Total Spent = Price Per Unit × Quantity for ALL rows
print('\n' + '=' * 80)
print('Mathematical Consistency Validation:')
print('=' * 80)

# Create filter for rows with all three fields present
# .notna() returns True where values are NOT missing
# .all(axis=1) checks if all three conditions are True for each row
completeRows = df[[PRICE_PER_UNIT, QUANTITY, TOTAL_SPENT]].notna().all(axis=1)
dataComplete = df[completeRows].copy()

print(f'Rows with complete numeric fields: {len(dataComplete)} out of {len(df)}')

# Calculate expected Total Spent using reconstructed prices
# element-wise multiplication
dataComplete['Calculated Total'] = dataComplete[PRICE_PER_UNIT] * dataComplete[QUANTITY]

# Calculate absolute difference between actual and calculated
dataComplete['Difference'] = (dataComplete[TOTAL_SPENT] - dataComplete['Calculated Total']).abs()

# Count rows with significant differences (> 0.01 to account for floating point precision)
inconsistent = (dataComplete['Difference'] > 0.01).sum()

mathConsistencyRate = ((len(dataComplete) - inconsistent) / len(dataComplete) * 100) if len(dataComplete) > 0 else 0.0
print(f'Rows with mathematical inconsistency (diff > 0.01): {inconsistent}')
print(f'Mathematical consistency rate: {mathConsistencyRate:.2f}%')

if inconsistent == 0:
    print('\n✓ Perfect accuracy: All rows satisfy Total Spent = Price Per Unit × Quantity')
    print('✓ Zero estimation error achieved through deterministic reconstruction')
else:
    print(f'\n⚠ Warning: {inconsistent} rows have inconsistent calculations')
    print('Sample of inconsistent rows:')
    print(dataComplete[dataComplete['Difference'] > 0.01][[PRICE_PER_UNIT, QUANTITY, TOTAL_SPENT, 'Calculated Total', 'Difference']].head())

# Provide alias expected by downstream cells (use predefined camel-case missingPrice)
# downstream cells use `missingPrice`, so set it to the existing `missingPrice`
missingPrice = missingPrice


Missing value check after reconstruction:
Price Per Unit missing: 0
Total Spent missing: 0
Quantity missing: 0

✓ Price Per Unit is now 100% complete: True

Mathematical Consistency Validation:
Rows with complete numeric fields: 11971 out of 11971
Rows with mathematical inconsistency (diff > 0.01): 0
Mathematical consistency rate: 100.00%

✓ Perfect accuracy: All rows satisfy Total Spent = Price Per Unit × Quantity
✓ Zero estimation error achieved through deterministic reconstruction


### Sample inspection: Before and after reconstruction

Display sample rows that were reconstructed to verify the calculation worked correctly.


In [50]:
# Display sample of reconstructed rows
print('Sample of rows after Price Per Unit reconstruction:')
print('=' * 80)
# missingPrice is still the original boolean filter (before reconstruction)
# Use it to show the same rows, now with reconstructed prices
print(df[missingPrice][[TRANSACTION_ID, CATEGORY, ITEM, PRICE_PER_UNIT, QUANTITY, TOTAL_SPENT]].head(10))

print('\nVerification:')
# For each row shown above, manually verify the calculation
# Select a few rows and show that Price × Quantity = Total
sampleData = df[missingPrice].head(5)
for idx, row in sampleData.iterrows():
    # row[PRICE_PER_UNIT] is the reconstructed price
    # row[QUANTITY] is the given quantity
    # row[TOTAL_SPENT] is the given total
    calculated = row[PRICE_PER_UNIT] * row[QUANTITY]
    actual = row[TOTAL_SPENT]
    # Compare calculated vs actual (should be very close, accounting for floating point precision)
    print(f"  Row {idx}: {row[PRICE_PER_UNIT]:.2f} × {row[QUANTITY]:.0f} = {calculated:.2f} (actual: {actual:.2f}) ✓")


Sample of rows after Price Per Unit reconstruction:
    Transaction ID                            Category Item  Price Per Unit  \
12     TXN_1007496                            Butchers  NaN            15.5   
50     TXN_1032287                                Food  NaN            21.5   
68     TXN_1044590       Electric household essentials  NaN            14.0   
70     TXN_1046262                       Milk Products  NaN            14.0   
71     TXN_1046367  Computers and electric accessories  NaN            18.5   
76     TXN_1051223                          Patisserie  NaN             5.0   
87     TXN_1058643                                Food  NaN             9.5   
104    TXN_1071762                           Beverages  NaN             9.5   
134    TXN_1095879                           Beverages  NaN             6.5   
136    TXN_1096977                                Food  NaN            23.0   

     Quantity  Total Spent  
12       10.0        155.0  
50        2.0       

### Impact on remaining missing values

Analyze the current state of missing data after Price Per Unit reconstruction. Only `Item` should have missing values remaining (609 rows).


In [51]:
print('Current missing value status across all columns:')
print('=' * 80)

# Check all columns for missing values
# .isnull().sum() counts missing values for each column
missingSummary = df.isnull().sum()
# Filter to show only columns with missing values
missingCols = missingSummary[missingSummary > 0]

if len(missingCols) > 0:
    print('Columns with missing values:')
    for col, count in missingCols.items():
        # Calculate percentage of missing values
        pct = (count / len(df)) * 100
        print(f'  {col:30s}: {count:5d} ({pct:5.2f}%)')
else:
    print('✓ No missing values in any column')

print('\n' + '=' * 80)
print('Summary of STEP 2 completion:')
print('=' * 80)
print(f'✓ Price Per Unit: 100% complete (was 5.09%, reconstructed 609 values)')
print(f'✓ Total Spent: 100% complete (from STEP 1)')
print(f'✓ Quantity: 100% complete (from STEP 1)')
print(f'  Item: {df[ITEM].isna().sum()} missing ({(df[ITEM].isna().sum()/len(df)*100):.2f}%) - to be handled in STEP 3')


Current missing value status across all columns:
Columns with missing values:
  Item                          :   609 ( 5.09%)
  Discount Applied              :  3988 (33.31%)

Summary of STEP 2 completion:
✓ Price Per Unit: 100% complete (was 5.09%, reconstructed 609 values)
✓ Total Spent: 100% complete (from STEP 1)
✓ Quantity: 100% complete (from STEP 1)
  Item: 609 missing (5.09%) - to be handled in STEP 3


### Persist results

Save the dataset with Price Per Unit reconstructed. This becomes the input for STEP 3 (Item imputation).


In [52]:
# Save dataset with Price Per Unit reconstructed to CSV
# to_csv writes the DataFrame to a CSV file
# index=False prevents writing row numbers as a column
# This creates the output file that will be used in STEP 3 (Item imputation)
df.to_csv(CSV_OUT, index=False)
print(f'✓ Dataset with Price Per Unit reconstructed saved to {CSV_OUT}')
print(f'  Final row count: {len(df)}')
print(f'  Price Per Unit: 100% complete (609 values reconstructed)')
print(f'  Ready for next step: Item imputation (STEP 3)')


✓ Dataset with Price Per Unit reconstructed saved to ../../output_data/2_price_per_unit/price_per_unit_reconstructed.csv
  Final row count: 11971
  Price Per Unit: 100% complete (609 values reconstructed)
  Ready for next step: Item imputation (STEP 3)


### Summary

**Price Per Unit Handling - STEP 2 Complete**

**Classification:** MAR (Missing At Random)
- Missingness depends on Item field (observable variable)
- Perfect co-missingness with Item (100% overlap)
- Missing rates vary by category (4.09%-5.56%)

**Method:** Deterministic reconstruction using formula
- Formula: `Price Per Unit = Total Spent ÷ Quantity`
- Reconstructed 609 values (100% of missing prices)
- Zero estimation error

**Justification:**
- Mathematical relationship exists and is exact
- Both Total Spent and Quantity are 100% complete (from STEP 1)
- No statistical imputation needed when exact calculation is possible
- Maintains perfect mathematical consistency

**Validation results:**
- ✓ All 609 missing prices successfully reconstructed
- ✓ 100% mathematical consistency: Total = Price × Quantity
- ✓ Zero reconstruction errors
- ✓ Price Per Unit is now 100% complete

**Next steps:**
1. STEP 3: Impute Item using mode by Category (609 missing values)
2. STEP 4: Handle Discount Applied as "Unknown" category
