<a href="https://colab.research.google.com/github/pratimdas/googlecolab/blob/main/Chapter_2_Recipie_11_(Duplicates_Removal_Chain).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup & Data Loading

This section focuses on:
**What it does:**

Sets up the environment with required libraries
Loads the dataset from your Colab sample_data location
Provides basic dataset inspection

**Expected Output:**

Confirmation of successful library imports
Dataset shape and memory usage
Column names list
First 3 rows preview

**What to look for:**

Verify the dataset loads correctly
Note the column names (we'll use these for duplicate detection)
Check the data types and overall structure

In [None]:
# Recipe 11: Duplicates Removal Chain - Section 1
# Setup & Data Loading

"""
PURPOSE: Set up the environment and load the retail sales dataset for duplicate analysis
EXPECTED OUTPUT: Basic dataset info and shape confirmation
"""

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("📦 Libraries imported successfully!")
print("=" * 50)

# Load the dataset from Colab sample_data
def load_retail_data():
    """
    Load the retail sales dataset from Colab sample_data directory
    Returns: pandas DataFrame or None if loading fails
    """
    filepath = '/content/sample_data/retail_store_sales.csv'

    try:
        df = pd.read_csv(filepath, low_memory=False)
        print(f"✅ Dataset loaded successfully!")
        print(f"📊 Shape: {df.shape[0]} rows, {df.shape[1]} columns")
        return df
    except FileNotFoundError:
        print(f"❌ File not found at {filepath}")
        print("Please ensure retail_store_sales.csv is in /content/sample_data/")
        return None
    except Exception as e:
        print(f"❌ Error loading dataset: {e}")
        return None

# Load the data
df = load_retail_data()

# Basic dataset inspection
if df is not None:
    print("\n" + "=" * 50)
    print("BASIC DATASET INSPECTION")
    print("=" * 50)

    print(f"Dataset shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

    print("\nColumn names:")
    for i, col in enumerate(df.columns, 1):
        print(f"{i:2d}. {col}")

    print("\nFirst 3 rows:")
    print(df.head(3))

    print("\n✅ Section 1 Complete - Dataset loaded and inspected!")
    print("📋 Ready to move to Section 2: Initial Duplicate Detection")
else:
    print("❌ Cannot proceed - dataset loading failed!")

📦 Libraries imported successfully!
✅ Dataset loaded successfully!
📊 Shape: 12575 rows, 11 columns

BASIC DATASET INSPECTION
Dataset shape: (12575, 11)
Memory usage: 6.29 MB

Column names:
 1. Transaction ID
 2. Customer ID
 3. Category
 4. Item
 5. Price Per Unit
 6. Quantity
 7. Total Spent
 8. Payment Method
 9. Location
10. Transaction Date
11. Discount Applied

First 3 rows:
  Transaction ID Customer ID       Category          Item  Price Per Unit  \
0    TXN_6867343     CUST_09     Patisserie   Item_10_PAT            18.5   
1    TXN_3731986     CUST_22  Milk Products  Item_17_MILK            29.0   
2    TXN_9303719     CUST_02       Butchers   Item_12_BUT            21.5   

   Quantity  Total Spent  Payment Method Location Transaction Date  \
0      10.0        185.0  Digital Wallet   Online       2024-04-08   
1       9.0        261.0  Digital Wallet   Online       2023-07-23   
2       2.0         43.0     Credit Card   Online       2022-10-05   

  Discount Applied  
0      

# Drop Duplicates & Nulls in One Chain
**PURPOSE**: Remove key-based duplicate transactions and drop any rows with missing values

**EXPECTED OUTPUT**: Cleaned DataFrame shape and preview

In [None]:
# Section 3: Drop Duplicates & Nulls in One Chain

print("\n" + "=" * 50)
print("SECTION 3: DROP DUPLICATES & NULLS IN ONE CHAIN")
print("=" * 50)

# Define the key columns that uniquely identify a transaction
# Corrected column names to match the DataFrame
key_cols = ['Transaction ID', 'Customer ID', 'Transaction Date', 'Item']

# Perform the chained clean:
#  1. drop_duplicates on key_cols (keeping first occurrence)
#  2. drop any rows that still have missing values in any column
df_clean = (
    df
    .drop_duplicates(subset=key_cols, keep='first')
    .dropna()
)

# Report row counts before and after cleaning
rows_before = df.shape[0]
rows_after  = df_clean.shape[0]
print(f"Rows before cleaning: {rows_before}")
print(f"Rows after  cleaning: {rows_after}")
print(f"✅ Removed {rows_before - rows_after} rows (duplicates + nulls)")

# Preview the first few rows of the cleaned DataFrame
print("\nFirst 5 rows of df_clean:")
display(df_clean.head(5))

print("\n✅ Section 3 Complete — duplicates & nulls removed!")
print("📋 Recipe 11 Complete — you now have a deduplicated, null-free DataFrame (df_clean)")


SECTION 3: DROP DUPLICATES & NULLS IN ONE CHAIN
Rows before cleaning: 12575
Rows after  cleaning: 7579
✅ Removed 4996 rows (duplicates + nulls)

First 5 rows of df_clean:


Unnamed: 0,Transaction ID,Customer ID,Category,Item,Price Per Unit,Quantity,Total Spent,Payment Method,Location,Transaction Date,Discount Applied
0,TXN_6867343,CUST_09,Patisserie,Item_10_PAT,18.5,10.0,185.0,Digital Wallet,Online,2024-04-08,True
1,TXN_3731986,CUST_22,Milk Products,Item_17_MILK,29.0,9.0,261.0,Digital Wallet,Online,2023-07-23,True
2,TXN_9303719,CUST_02,Butchers,Item_12_BUT,21.5,2.0,43.0,Credit Card,Online,2022-10-05,False
4,TXN_4575373,CUST_05,Food,Item_6_FOOD,12.5,7.0,87.5,Digital Wallet,Online,2022-10-02,False
6,TXN_3652209,CUST_07,Food,Item_1_FOOD,5.0,8.0,40.0,Credit Card,In-store,2023-06-10,True



✅ Section 3 Complete — duplicates & nulls removed!
📋 Recipe 11 Complete — you now have a deduplicated, null-free DataFrame (df_clean)


# Sanity Checks
**PURPOSE**: Verify that no key‐based duplicates or nulls remain

**EXPECTED OUTPUT**: Zero duplicates and zero missing values

In [None]:
# Section 4: Sanity Checks
# PURPOSE: Verify that no key‐based duplicates or nulls remain
# EXPECTED OUTPUT: Zero duplicates and zero missing values

print("\n" + "=" * 50)
print("SECTION 4: SANITY CHECKS")
print("=" * 50)

# 1. No remaining key‐based duplicates?
remaining_dups = df_clean.duplicated(subset=key_cols).sum()
print(f"🔍 Remaining duplicates on {key_cols}: {remaining_dups}")

# 2. No remaining missing values?
remaining_nulls = df_clean.isna().sum().sum()
print(f"❓ Total missing values in df_clean: {remaining_nulls}")

assert remaining_dups == 0, "There are still duplicate rows present!"
assert remaining_nulls == 0, "There are still missing values present!"

print("\n✅ All sanity checks passed—df_clean is deduplicated and null‐free!")

def dedupe_and_validate(df, keys):
    df_clean = df.drop_duplicates(subset=keys).dropna()
    assert df_clean.duplicated(subset=keys).sum() == 0
    assert df_clean.isna().sum().sum() == 0
    return df_clean

df_clean = dedupe_and_validate(df, key_cols)



SECTION 4: SANITY CHECKS
🔍 Remaining duplicates on ['Transaction ID', 'Customer ID', 'Transaction Date', 'Item']: 0
❓ Total missing values in df_clean: 0

✅ All sanity checks passed—df_clean is deduplicated and null‐free!
