# Finding missing pattern
## Data overview

In [2]:
import pandas as pd

from visualize_missing_data import columns_with_missing

In [14]:
df = pd.read_csv("../Deliverable1Dataset.csv")
data = df.copy()

In [15]:
print(f'total rows: {data.shape[0]}')
print(f'total columns: {data.shape[1]}')

missing_counts = data.isnull().sum()
missing_percentage = (missing_counts / len(data)) * 100

report = pd.DataFrame({
    "Column": data.columns,
    "Missing Count": missing_counts.values,
    "Missing %": missing_percentage.round(2).astype(str) + "%"
})

report = report[report["Missing Count"] > 0]

print(report.to_string(index=False))


total rows: 12575
total columns: 11
          Column  Missing Count Missing %
            Item           1213     9.65%
  Price Per Unit            609     4.84%
        Quantity            604      4.8%
     Total Spent            604      4.8%
Discount Applied           4199    33.39%


## Classification for missing data
### 1. **Total Spent**

In [17]:
missing_total = data['Total Spent'].isna()

print(f'Missing Total Spent rows: {missing_total.sum()} of {len(data)} ({missing_total.mean():.2%})')


qty_overlap = (missing_total & data['Quantity'].isna()).sum()
print(f'Rows with both Total Spent and Quantity missing: {qty_overlap}')
print(f'Total Spent missing: {missing_total.sum()}')
print(f'Perfect overlap: {qty_overlap == missing_total.sum()}')


price_overlap = (missing_total & data['Price Per Unit'].isna()).sum()
print(f'\nRows with both Total Spent and Price Per Unit missing: {price_overlap}')


item_overlap = (missing_total & data['Item'].isna()).sum()
print(f'Rows with both Total Spent and Item missing: {item_overlap}')
print(f'Percentage of Total Spent missing cases with Item also missing: {item_overlap / missing_total.sum():.1%}')

Missing Total Spent rows: 604 of 12575 (4.80%)
Rows with both Total Spent and Quantity missing: 604
Total Spent missing: 604
Perfect overlap: True

Rows with both Total Spent and Price Per Unit missing: 0
Rows with both Total Spent and Item missing: 604
Percentage of Total Spent missing cases with Item also missing: 100.0%


In [19]:
summary = data.assign(missing_total=missing_total).groupby('Category')['missing_total'].mean().sort_values(ascending=False)
print('Percentage of Total Spent missing by Category:')
print(summary)

Percentage of Total Spent missing by Category:
Category
Patisserie                            0.056937
Computers and electric accessories    0.051990
Food                                  0.051008
Electric household essentials         0.047140
Butchers                              0.045918
Beverages                             0.045310
Milk Products                         0.044823
Furniture                             0.041483
Name: missing_total, dtype: float64


**Detection**
- When the Quantity is missing, Total Spent is ALWAYS missing (604 records)
- When the Item is missing, Total Spent and Quantity are missing as well (604 records)
- This means that for some transactions where Item were not captured, the other fields like Quantity and Total Spent were also not recorded
- The percentage of missingness in total spent are different between categories, few of them were higher than others.

**Conclusion**
- The missingness depends on the Total Spent field => Missing At Random (MAR)


### 2. **Quantity**


In [21]:
missing_qty_total = data['Quantity'].isna()

summary_qty = data.assign(missing_qty_total=missing_qty_total).groupby('Category')['missing_qty_total'].mean().sort_values(ascending=False)
print('Percentage of Quantity missing by Category:')
print(summary_qty)

Percentage of Quantity missing by Category:
Category
Patisserie                            0.056937
Computers and electric accessories    0.051990
Food                                  0.051008
Electric household essentials         0.047140
Butchers                              0.045918
Beverages                             0.045310
Milk Products                         0.044823
Furniture                             0.041483
Name: missing_qty_total, dtype: float64


**Detection**
- Quantity has the same missingness with the Total Spent since it also depend on the Item. If the item was empty then the quantity also affected
- The percentage of the missingness of Quantity also reflect the same trend of the Total Spent in each category

**Conclusion**
- Missing at Random (MAR) on Quantity

### 3. **Price per Unit**


In [24]:
missing_ppu = data['Price Per Unit'].isna()

print(f'Missing Price per Unit rows: {missing_ppu.sum()} of {len(data)} ({missing_ppu.mean():.2%})')

Missing Item rows: 1213 of 12575 (9.65%)
Missing Price per Unit rows: 609 of 12575 (4.84%)


**Detection**
- There was 609 observations that has the missing data, which overlap with the Total Spent and the Item
- However the number of missing Price Per Unit also associate with the Item since whenever the Item is missing, the Price Per Unit also missing
- The Price Per Unit can be reconstruct from Total Spent and Quantity by the formula "Total Spent / Quantity"

**Conclusion**
- Missing at Random (MAR) on the Price Per Unit

### 4. **Item**

In [25]:
missing_item_total = data['Item'].isna()

summary_item = data.assign(missing_item_total=missing_item_total).groupby('Category')['missing_item_total'].mean().sort_values(ascending=False)
print('Percentage of Item missing by Category:')
print(summary_item)

Percentage of Item missing by Category:
Category
Patisserie                            0.104058
Computers and electric accessories    0.103338
Food                                  0.102015
Milk Products                         0.100379
Electric household essentials         0.096794
Butchers                              0.093750
Beverages                             0.089343
Furniture                             0.082338
Name: missing_item_total, dtype: float64


**Detection**
- Missing rates are different between categories (some of the categories were having higher rate than others)
- There are 1213 observations that missing the Item
- However, the Item column is related to the Total Spent, Quantity with 604 cases and Price Per Unit with 609 cases
- If the category is present, we can detect the value for item

**Conclusion**
- Missing at Random on the Item


### 5. **Discount Applied**

In [26]:
missing_discount = data['Discount Applied'].isna()

print(f'Missing Discount Applied rows: {missing_discount.sum()} of {len(data)} ({missing_discount.mean():.2%})')


print(f'\nDiscount Applied value distribution:')

print(data['Discount Applied'].value_counts(dropna=False))

Missing Discount Applied rows: 4199 of 12575 (33.39%)

Discount Applied value distribution:
Discount Applied
True     4219
NaN      4199
False    4157
Name: count, dtype: int64


**Detection**
- All the records that have missing values were taken about 30% of the total records.
- There is no missing relationship between Discount Applied with other fields
- The missingness also does not depend on other fields or have any pattern

**Conclusion**
- Missing at completely at Random (MCAR) on the Discount Applied

## The workflow for handling missing values

### 1. Total spent & Quantity
- Since both of these columns have the same match (604 observations that has missing data about 4.8%)
- Some of the observation can't be reconstruct using Item or Price Per Unit or Quantity

**Method**: Listwise deletion (delete all invalid observations) because 4.8% is a small values which does not have any effect to the entire dataset (as long as less than 5%)

**Impact**:
- Remaining dataset: 11971 rows
- Other 609 missing values in Item can be imputed


### 2. Price Per Unit
- Can be reconstructed using the formula "Price = Total Spent / Quantity"
- Both Total Spent and Quantity will be available after the first step

**Method**: Using the formula "Price = Total Spent / Quantity"

**Impact**:
- All the Price Per Unit values will be filled

### 3. Item
- There are 609 observations are missing Item after previous steps
- All of them were having Price Per Unit, Quantity and Total Spent
- Every missing Item retains a valid Category; per-category counts remain balanced (example: Milk Products 88, Food 81, Furniture 65)
- Each category has at least one observed item, enabling deterministic filling via category-specific modes (e.g., Milk Products -> Item_16_MILK, Furniture -> Item_25_FUR)

**Method**: Compute the most frequent item per category from observations and replace the missing item by mapping each category to the mapping between category and most frequent item.

**Impact**:
- Category and Item relationship will be preserved
- No impact on other columns

### 4. Discount Applied
- Since it was missing completely at random (MCAR) and the missing percentage was large (more than 30%)

**Method**:
- Can create a new category named "Unknown" to handle the missingness, all the missing values will be reconstructed to "Unknown" category
- Random imputation since the value of the column was binary

**Impact**:
- All the missing values were resolved

