In [3]:
import pandas as pd
import numpy as np


In [4]:
data = pd.read_csv('../Deliverable1Dataset.csv')


### Determine missingness of Total Spent

Before checking, convert all the existing data types (string) to appropriate numeric type for comparison the relationship since `Price Per Unit`, `Quantity` and `Total Spent` are related. Check to see how many missing observations for `Total Spent`


In [5]:
# Convert numeric columns to proper numeric type
# pd.to_numeric converts string values to numeric, handling errors by coercing invalid values to NaN
# This ensures mathematical operations work correctly
for col in ['Price Per Unit', 'Quantity', 'Total Spent']:
    # data[col] accesses each column
    # errors='coerce' converts non-numeric values to NaN instead of raising an error
    data[col] = pd.to_numeric(data[col], errors='coerce')

# Count missing Total Spent values
# data['Total Spent'].isna() creates a boolean Series where True indicates missing values
# .sum() counts the number of True values (i.e., missing entries)
missing_total = data['Total Spent'].isna()

print(f'Missing Total Spent rows: {missing_total.sum()} of {len(data)} ({missing_total.mean():.2%})')


Missing Total Spent rows: 604 of 12575 (4.80%)


### Missingness mechanism

Quantifying how often `Total Spent` is missing within each `Category`, `payment method`, and `location` to see their relationships with Total Spent if they were depended on or related to each other. From that can conclude what kind of missing data type like Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR) and propose the appropriate approach to handle


In [6]:
# Analyze missingness patterns across categories
# .assign creates a new column 'missing_total' with the boolean missing indicator
# .groupby('Category') groups all rows by their category value
# ['missing_total'].mean() calculates the proportion of missing values per category
# .sort_values(ascending=False) sorts categories by missing proportion (highest first)
summary = data.assign(missing_total=missing_total).groupby('Category')['missing_total'].mean().sort_values(ascending=False)
print('Share of Total Spent missing by Category:')
print(summary)

# Analyze missingness patterns across payment methods
# Same logic as above, but grouped by 'Payment Method'
payment_share = data.assign(missing_total=missing_total).groupby('Payment Method')['missing_total'].mean().sort_values(ascending=False)
print('\nShare of Total Spent missing by Payment Method:')
print(payment_share)

# Analyze missingness patterns across locations
# Same logic as above, but grouped by 'Location'
location_share = data.assign(missing_total=missing_total).groupby('Location')['missing_total'].mean().sort_values(ascending=False)
print('\nShare of Total Spent missing by Location:')
print(location_share)


Share of Total Spent missing by Category:
Category
Patisserie                            0.056937
Computers and electric accessories    0.051990
Food                                  0.051008
Electric household essentials         0.047140
Butchers                              0.045918
Beverages                             0.045310
Milk Products                         0.044823
Furniture                             0.041483
Name: missing_total, dtype: float64

Share of Total Spent missing by Payment Method:
Payment Method
Digital Wallet    0.048986
Cash              0.048028
Credit Card       0.047076
Name: missing_total, dtype: float64

Share of Total Spent missing by Location:
Location
In-store    0.051117
Online      0.045011
Name: missing_total, dtype: float64


### Co-missingness analysis

Examining whether `Total Spent` is missing alongside other key fields which determine the appropriate patterns. 


In [7]:
# Check co-missingness with Quantity
# missing_total & data['Quantity'].isna() creates boolean Series that is True only when BOTH are missing
# .sum() counts how many rows have both fields missing
# This reveals if Total Spent and Quantity are missing together (perfect overlap = same count)
qty_overlap = (missing_total & data['Quantity'].isna()).sum()
print(f'Rows with both Total Spent and Quantity missing: {qty_overlap}')
print(f'Total Spent missing: {missing_total.sum()}')
print(f'Perfect overlap: {qty_overlap == missing_total.sum()}')

# Check co-missingness with Price Per Unit
# Similar logic: count rows where both Total Spent and Price Per Unit are missing
price_overlap = (missing_total & data['Price Per Unit'].isna()).sum()
print(f'\nRows with both Total Spent and Price Per Unit missing: {price_overlap}')

# Check co-missingness with Item
# Count rows where both Total Spent and Item are missing
item_overlap = (missing_total & data['Item'].isna()).sum()
print(f'Rows with both Total Spent and Item missing: {item_overlap}')
print(f'Percentage of Total Spent missing cases with Item also missing: {item_overlap / missing_total.sum():.1%}')


Rows with both Total Spent and Quantity missing: 604
Total Spent missing: 604
Perfect overlap: True

Rows with both Total Spent and Price Per Unit missing: 0
Rows with both Total Spent and Item missing: 604
Percentage of Total Spent missing cases with Item also missing: 100.0%


### Reconstructability assessment

Since `Total Spent = Price Per Unit × Quantity`, we assess how many missing `Total Spent` values could theoretically be reconstructed from the other two fields. This determines whether imputation is feasible or deletion is necessary.


In [8]:
# Check if Total Spent can be reconstructed from Price Per Unit and Quantity
# For reconstruction, we need Total Spent to be missing BUT both Price and Quantity to be present
# missing_total ensures Total Spent is missing
# data['Price Per Unit'].notna() ensures Price Per Unit is NOT missing
# data['Quantity'].notna() ensures Quantity is NOT missing
# & combines all three conditions (all must be True)
reconstructable = missing_total & data['Price Per Unit'].notna() & data['Quantity'].notna()
reconstructable_count = reconstructable.sum()

print(f'Missing Total Spent that CAN be reconstructed: {reconstructable_count} out of {missing_total.sum()}')
print(f'Reconstruction rate: {reconstructable_count / missing_total.sum():.1%}')

# Check irrecoverable cases (missing Total Spent AND at least one other field)
# ~ negates reconstructable, giving us rows where reconstruction is NOT possible
# These are rows where Total Spent is missing AND at least one of (Price, Quantity) is also missing
irrecoverable = missing_total & ~reconstructable
irrecoverable_count = irrecoverable.sum()

print(f'\nMissing Total Spent that CANNOT be reconstructed: {irrecoverable_count} out of {missing_total.sum()}')
print(f'Irrecoverable rate: {irrecoverable_count / missing_total.sum():.1%}')


Missing Total Spent that CAN be reconstructed: 0 out of 604
Reconstruction rate: 0.0%

Missing Total Spent that CANNOT be reconstructed: 604 out of 604
Irrecoverable rate: 100.0%


### Missing data classification

**Classification: MAR (Missing At Random)**

**Rationale:**
- Perfect co-missingness with `Quantity` (604 cases = 100% overlap)
- Missing rates vary by category (4.15%-5.69%), indicating dependence on observable characteristics
- Strongly correlated with `Item` field missingness (100% of cases have Item also missing)
- The missingness is systematic and related to the Item field (an observable variable)
- Not MCAR because missing rates are not uniform across categories
- Not MNAR because the missingness is explained by observable variables (Item field status)

**Key finding:** When `Item` was not recorded during data collection, both `Quantity` and `Total Spent` were also systematically omitted, suggesting a data entry workflow issue rather than values being hidden due to their magnitude.


### Handling strategy: Listwise deletion

**Justification for deletion (not imputation):**

1. **Critical target variable:** `Total Spent` is essential for transaction analysis and should not be estimated
2. **Perfect co-missingness:** These 604 rows also have missing `Quantity`, making reconstruction impossible
3. **Cannot reliably impute:** Missing both Quantity AND at least one other key field
4. **Small data loss:** Only 4.8% of the dataset vs. large gain in data integrity
5. **Side benefit:** Eliminates 604 problematic cases that would require multiple imputations

**Alternative considered:** Reconstruct using Price × Quantity
- Not feasible: 0% of missing Total Spent cases have both Price and Quantity present
- Would require imputing Quantity first, introducing estimation error into a critical field


In [9]:
# Display sample of rows to be dropped
print('Sample of rows with missing Total Spent (to be deleted):')
print('=' * 80)
# data[missing_total] filters to show only rows where Total Spent is missing
# .head(10) shows the first 10 such rows
# This allows visual inspection of what data will be lost
print(data[missing_total][['Transaction ID', 'Category', 'Item', 'Price Per Unit', 'Quantity', 'Total Spent']].head(10))

print('\nObservations about rows to be deleted:')
print('- All have missing Quantity (perfect overlap)')
print('- All have missing Item (100% overlap)')
print('- Cannot reconstruct Total Spent without Quantity')
print('- Represent systematic data collection gaps, not random missingness')


Sample of rows with missing Total Spent (to be deleted):
    Transaction ID                            Category Item  Price Per Unit  \
6      TXN_1005543                                Food  NaN            30.5   
64     TXN_1041483       Electric household essentials  NaN            15.5   
65     TXN_1041890                           Furniture  NaN            27.5   
104    TXN_1069238                                Food  NaN             5.0   
180    TXN_1130015                       Milk Products  NaN             9.5   
216    TXN_1153995       Electric household essentials  NaN            23.0   
217    TXN_1154680                           Furniture  NaN            35.0   
225    TXN_1158381                            Butchers  NaN            36.5   
249    TXN_1175914                            Butchers  NaN            23.0   
262    TXN_1187836  Computers and electric accessories  NaN            38.0   

     Quantity  Total Spent  
6         NaN          NaN  
64        NaN  

In [10]:
# Count rows before deletion
# len(data) returns the total number of rows
rows_before = len(data)
print(f'Rows before deletion: {rows_before}')
print(f'Rows to be deleted: {missing_total.sum()}')

# Perform listwise deletion
# .dropna removes rows with NaN values
# subset=['Total Spent'] specifies to only check the Total Spent column
# This removes any row where Total Spent is missing
data_cleaned = data.dropna(subset=['Total Spent'])

# Count rows after deletion
rows_after = len(data_cleaned)
# Calculate retention rate as percentage
retention_rate = (rows_after / rows_before) * 100

print(f'\nRows after deletion: {rows_after}')
print(f'Rows deleted: {rows_before - rows_after}')
print(f'Data retention rate: {retention_rate:.2f}%')


Rows before deletion: 12575
Rows to be deleted: 604

Rows after deletion: 11971
Rows deleted: 604
Data retention rate: 95.20%


### Validation: Side benefits of deletion

Verify that deleting rows with missing `Total Spent` also eliminates other missing value problems, particularly with `Quantity`.


In [11]:
# Check missing values in numeric columns after deletion
print('Missing value counts after Total Spent deletion:')
print('=' * 80)

# Iterate through critical numeric columns
for col in ['Total Spent', 'Quantity', 'Price Per Unit', 'Item']:
    # data_cleaned[col].isna().sum() counts missing values in each column
    missing_count = data_cleaned[col].isna().sum()
    # Calculate percentage of missing values
    missing_pct = (missing_count / len(data_cleaned)) * 100
    # f-string formatting: {col:20s} pads column name to 20 characters for alignment
    print(f'{col:20s}: {missing_count:5d} ({missing_pct:5.2f}%)')

# Verify Total Spent is now complete
print(f'\n✓ Total Spent is now 100% complete: {data_cleaned["Total Spent"].isna().sum() == 0}')
# Verify Quantity is now complete (due to perfect co-missingness)
print(f'✓ Quantity is now 100% complete: {data_cleaned["Quantity"].isna().sum() == 0}')


Missing value counts after Total Spent deletion:
Total Spent         :     0 ( 0.00%)
Quantity            :     0 ( 0.00%)
Price Per Unit      :   609 ( 5.09%)
Item                :   609 ( 5.09%)

✓ Total Spent is now 100% complete: True
✓ Quantity is now 100% complete: True


### Mathematical consistency check

Verify that for all complete rows, the relationship `Total Spent = Price Per Unit × Quantity` holds true. This validates data quality and ensures no mathematical inconsistencies exist.


In [12]:
# Create filter for rows where all three numeric fields are present
# .notna() returns True where values are NOT missing
# .all(axis=1) checks if all three conditions are True for each row
complete_rows = data_cleaned[['Price Per Unit', 'Quantity', 'Total Spent']].notna().all(axis=1)
# Filter to only complete rows
complete_data = data_cleaned[complete_rows].copy()

print(f'Rows with complete Price, Quantity, and Total Spent: {len(complete_data)}')

# Calculate expected Total Spent using the formula
# complete_data['Price Per Unit'] * complete_data['Quantity'] performs element-wise multiplication
complete_data['Calculated_Total'] = complete_data['Price Per Unit'] * complete_data['Quantity']

# Calculate absolute difference between actual and calculated
# abs() returns absolute value (always positive)
complete_data['Difference'] = abs(complete_data['Total Spent'] - complete_data['Calculated_Total'])

# Count rows with significant differences (> 0.01 to account for floating point precision)
# complete_data['Difference'] > 0.01 creates boolean Series
# .sum() counts True values
inconsistent = (complete_data['Difference'] > 0.01).sum()

print(f'Rows with mathematical inconsistency (diff > 0.01): {inconsistent}')
print(f'Mathematical consistency rate: {((len(complete_data) - inconsistent) / len(complete_data) * 100):.2f}%')

if inconsistent == 0:
    print('\n✓ All rows satisfy the formula: Total Spent = Price Per Unit × Quantity')
else:
    print(f'\n⚠ Warning: {inconsistent} rows have inconsistent calculations')
    # Show sample of inconsistent rows for investigation
    print('\nSample of inconsistent rows:')
    print(complete_data[complete_data['Difference'] > 0.01][['Price Per Unit', 'Quantity', 'Total Spent', 'Calculated_Total', 'Difference']].head())


Rows with complete Price, Quantity, and Total Spent: 11362
Rows with mathematical inconsistency (diff > 0.01): 0
Mathematical consistency rate: 100.00%

✓ All rows satisfy the formula: Total Spent = Price Per Unit × Quantity


### Impact on remaining missing values

Analyze how deletion of Total Spent missing rows affects the overall missing data landscape, particularly for `Item` and `Price Per Unit` which will be handled in subsequent steps.


In [13]:
print('Impact on Item missingness:')
print('=' * 80)
# Count missing Item values before and after deletion
item_missing_before = data['Item'].isna().sum()
item_missing_after = data_cleaned['Item'].isna().sum()
item_removed = item_missing_before - item_missing_after

print(f'Item missing before: {item_missing_before}')
print(f'Item missing after: {item_missing_after}')
print(f'Item missing rows removed: {item_removed} ({item_removed / item_missing_before * 100:.1f}% of Item missing cases)')
print(f'Item missing rows remaining: {item_missing_after} ({item_missing_after / item_missing_before * 100:.1f}%)')

print('\nImpact on Price Per Unit missingness:')
print('=' * 80)
# Count missing Price Per Unit values before and after deletion
price_missing_before = data['Price Per Unit'].isna().sum()
price_missing_after = data_cleaned['Price Per Unit'].isna().sum()
price_removed = price_missing_before - price_missing_after

print(f'Price Per Unit missing before: {price_missing_before}')
print(f'Price Per Unit missing after: {price_missing_after}')
print(f'Price Per Unit missing rows removed: {price_removed} ({price_removed / price_missing_before * 100:.1f}%)')
print(f'Price Per Unit missing rows remaining: {price_missing_after} ({price_missing_after / price_missing_before * 100:.1f}%)')

print('\n**Key insight:** Remaining missing values can now be imputed using available data:')
print(f'  - {price_missing_after} Price Per Unit values can be reconstructed using Total ÷ Quantity')
print(f'  - {item_missing_after} Item values can be imputed using Category information')


Impact on Item missingness:
Item missing before: 1213
Item missing after: 609
Item missing rows removed: 604 (49.8% of Item missing cases)
Item missing rows remaining: 609 (50.2%)

Impact on Price Per Unit missingness:
Price Per Unit missing before: 609
Price Per Unit missing after: 609
Price Per Unit missing rows removed: 0 (0.0%)
Price Per Unit missing rows remaining: 609 (100.0%)

**Key insight:** Remaining missing values can now be imputed using available data:
  - 609 Price Per Unit values can be reconstructed using Total ÷ Quantity
  - 609 Item values can be imputed using Category information


### Persist results

Save the cleaned dataset with Total Spent missing rows removed. This becomes the input for subsequent imputation steps (Price Per Unit, then Item).


In [16]:
# Save cleaned dataset to CSV
# to_csv writes the DataFrame to a CSV file
# index=False prevents writing row numbers as a column
# This creates the output file that will be used in the next step (Price Per Unit imputation)
output_path = '../output_data/1_total_spent/total_spent_cleaned.csv'
data_cleaned.to_csv(output_path, index=False)
print(f'✓ Dataset with Total Spent missing rows removed saved to total_spent/{output_path}')
print(f'  Final row count: {len(data_cleaned)}')
print(f'  Ready for next step: Price Per Unit imputation')


✓ Dataset with Total Spent missing rows removed saved to total_spent/../output_data/1_total_spent/total_spent_cleaned.csv
  Final row count: 11971
  Ready for next step: Price Per Unit imputation


### Summary

**Total Spent Handling - STEP 1 Complete**

**Classification:** MAR (Missing At Random)
- Missingness depends on Item field (observable variable)
- Perfect co-missingness with Quantity (100% overlap)

**Method:** Listwise deletion
- Removed 604 rows (4.8% of dataset)
- Retained 95.2% of data

**Justification:**
- Total Spent is a critical target variable that should not be estimated
- Cannot reconstruct: all missing cases also lack Quantity
- Small data loss with large gain in data integrity

**Side benefits:**
- ✓ Quantity is now 100% complete (perfect co-missingness eliminated)
- ✓ Reduced Item missing from 1,213 to 609 cases
- ✓ All remaining missing values are now reconstructable/imputable

**Next steps:**
1. STEP 2: Impute Price Per Unit using formula (Total ÷ Quantity)
2. STEP 3: Impute Item using mode by Category
3. STEP 4: Handle Discount Applied as "Unknown" category
