# **Data Cleaning**

## **2. Handling Duplicates**

In [1]:
import numpy as np
import pandas as pd 

## 🔍 Why Handle Duplicates?

Duplicate records can:

* Bias your analysis (e.g., overcount users, inflate sales)
* Impact model training by leaking information
* Waste memory and processing time

In [2]:
data = {
    'CustomerID': [101, 102, 103, 104, 101, 103, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Charlie', 'Eva'],
    'City': ['NY', 'LA', 'SF', 'NY', 'NY', 'SF', 'LA'],
    'Purchase': [200, 150, 300, 400, 200, 300, 500]
}

df = pd.DataFrame(data)

df

Unnamed: 0,CustomerID,Name,City,Purchase
0,101,Alice,NY,200
1,102,Bob,LA,150
2,103,Charlie,SF,300
3,104,David,NY,400
4,101,Alice,NY,200
5,103,Charlie,SF,300
6,105,Eva,LA,500


## 🧰 Key Techniques for Handling Duplicates

### 🔹 **1. Detect Duplicates**

#### ▶️ Method: `duplicated()`

In [3]:
df.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
6    False
dtype: bool

In [5]:
df.duplicated().sum()

2

In [8]:
(~df.duplicated()).sum()

5

In [10]:
df.duplicated(subset=['CustomerID'])

0    False
1    False
2    False
3    False
4     True
5     True
6    False
dtype: bool

In [11]:
df.duplicated(subset=['City'])

0    False
1    False
2    False
3     True
4     True
5     True
6     True
dtype: bool

In [None]:
df[df.duplicated()]

Unnamed: 0,CustomerID,Name,City,Purchase
4,101,Alice,NY,200
5,103,Charlie,SF,300


In [None]:
df[df.duplicated(subset=['City'])]

3    NY
4    NY
5    SF
6    LA
Name: City, dtype: object

#### ✅ Use Case:

You’re analyzing **sales data** and want to see if any **rows or customer IDs** have been duplicated accidentally (due to reimport or system error).

🔹 *Why this method?*
Helps you **flag duplicates** before deciding whether to drop or correct them.


### 🔹 **2. Drop Duplicates**

#### ▶️ Method: `drop_duplicates()`

In [16]:
df.drop_duplicates() # drop completely identical rows

Unnamed: 0,CustomerID,Name,City,Purchase
0,101,Alice,NY,200
1,102,Bob,LA,150
2,103,Charlie,SF,300
3,104,David,NY,400
6,105,Eva,LA,500


In [None]:
df.drop_duplicates(subset=['CustomerID']) # keep first by default

Unnamed: 0,CustomerID,Name,City,Purchase
0,101,Alice,NY,200
1,102,Bob,LA,150
2,103,Charlie,SF,300
3,104,David,NY,400
6,105,Eva,LA,500


In [19]:
df.drop_duplicates(subset=['City'])

Unnamed: 0,CustomerID,Name,City,Purchase
0,101,Alice,NY,200
1,102,Bob,LA,150
2,103,Charlie,SF,300


In [20]:
df.drop_duplicates(subset=['City'])

Unnamed: 0,CustomerID,Name,City,Purchase
0,101,Alice,NY,200
1,102,Bob,LA,150
2,103,Charlie,SF,300


In [21]:
df.drop_duplicates(subset=['CustomerID'], keep='last')

Unnamed: 0,CustomerID,Name,City,Purchase
1,102,Bob,LA,150
3,104,David,NY,400
4,101,Alice,NY,200
5,103,Charlie,SF,300
6,105,Eva,LA,500


In [None]:
df.drop_duplicates(subset=['CustomerID'], keep=False) # drop all duplicates

Unnamed: 0,CustomerID,Name,City,Purchase
1,102,Bob,LA,150
3,104,David,NY,400
6,105,Eva,LA,500


#### ✅ Real-World Use Cases:

* In a **user database**, keeping the **latest entry** for each user by ID.
* In survey data, drop respondents who submitted multiple identical responses.

🔹 *Why this method?*
Gives flexibility in choosing which duplicate to keep (`first`, `last`, or none).


### 🔹 **3. Keep Track of Duplicates Before Dropping**

In [23]:
df['is_duplicated'] = df.duplicated(subset=['CustomerID'], keep=False)

In [24]:
df

Unnamed: 0,CustomerID,Name,City,Purchase,is_duplicated
0,101,Alice,NY,200,True
1,102,Bob,LA,150,False
2,103,Charlie,SF,300,True
3,104,David,NY,400,False
4,101,Alice,NY,200,True
5,103,Charlie,SF,300,True
6,105,Eva,LA,500,False


In [25]:
df.drop_duplicates(subset=['CustomerID'])

Unnamed: 0,CustomerID,Name,City,Purchase,is_duplicated
0,101,Alice,NY,200,True
1,102,Bob,LA,150,False
2,103,Charlie,SF,300,True
3,104,David,NY,400,False
6,105,Eva,LA,500,False


#### ✅ Use Case:

In fraud detection, you might want to **flag users** with duplicate transaction IDs rather than dropping them.

🔹 *Why this method?*
When you want to **preserve duplicates info** instead of removing them.


### 🔹 **4. Count Frequency of Duplicate Values**

In [26]:
df.value_counts()

CustomerID  Name     City  Purchase  is_duplicated
101         Alice    NY    200       True             2
103         Charlie  SF    300       True             2
102         Bob      LA    150       False            1
104         David    NY    400       False            1
105         Eva      LA    500       False            1
Name: count, dtype: int64

In [27]:
df['CustomerID'].value_counts()

CustomerID
101    2
103    2
102    1
104    1
105    1
Name: count, dtype: int64

In [29]:
df[df['CustomerID'].duplicated(keep=False)]

Unnamed: 0,CustomerID,Name,City,Purchase,is_duplicated
0,101,Alice,NY,200,True
2,103,Charlie,SF,300,True
4,101,Alice,NY,200,True
5,103,Charlie,SF,300,True


#### ✅ Use Case:

In a **marketing campaign**, you want to find users who registered more than once with different email IDs.

🔹 *Why this method?*
Allows you to **understand how many times** a value is duplicated — useful for reporting or threshold-based filtering.


### 🔹 **5. Conditional Deduplication (Advanced)**

Example: Keep the row with the **maximum purchase** for each duplicate `CustomerID`.

In [31]:
df

Unnamed: 0,CustomerID,Name,City,Purchase,is_duplicated
0,101,Alice,NY,200,True
1,102,Bob,LA,150,False
2,103,Charlie,SF,300,True
3,104,David,NY,400,False
4,101,Alice,NY,200,True
5,103,Charlie,SF,300,True
6,105,Eva,LA,500,False


In [30]:
df.sort_values('Purchase', ascending=False).drop_duplicates('CustomerID', keep='first')

Unnamed: 0,CustomerID,Name,City,Purchase,is_duplicated
6,105,Eva,LA,500,False
3,104,David,NY,400,False
2,103,Charlie,SF,300,True
0,101,Alice,NY,200,True
1,102,Bob,LA,150,False


#### ✅ Use Case:

In **e-commerce**, retain the **highest purchase value** per customer in a reporting dataset.

🔹 *Why this method?*
Business logic-driven deduplication — **customized retention strategy**.


### 🔹 **6. Group-wise Deduplication**

Use `groupby()` to deduplicate based on business logic within groups.


In [36]:
df

Unnamed: 0,CustomerID,Name,City,Purchase,is_duplicated
0,101,Alice,NY,200,True
1,102,Bob,LA,150,False
2,103,Charlie,SF,300,True
3,104,David,NY,400,False
4,101,Alice,NY,200,True
5,103,Charlie,SF,300,True
6,105,Eva,LA,500,False


In [35]:
df.groupby('CustomerID').first().reset_index()

Unnamed: 0,CustomerID,Name,City,Purchase,is_duplicated
0,101,Alice,NY,200,True
1,102,Bob,LA,150,False
2,103,Charlie,SF,300,True
3,104,David,NY,400,False
4,105,Eva,LA,500,False


#### ✅ Use Case:

In event logs, for each `UserID`, keep the **first event** only.

🔹 *Why this method?*
Structured deduplication when your **unique entity is spread across multiple rows**.

## 📌 Summary Table

| Technique                  | When to Use                                                  | Example                                              |
| -------------------------- | ------------------------------------------------------------ | ---------------------------------------------------- |
| `duplicated()`             | To detect duplicates                                         | Flag duplicate customer IDs                          |
| `drop_duplicates()`        | Remove duplicate rows based on rules                         | Keep first or last duplicate rows                    |
| `duplicated(keep=False)`   | Drop all occurrences of duplicates                           | Clean dataset of repeat survey entries               |
| `value_counts()`           | Analyze frequency of duplication                             | See how many times a product ID occurs               |
| `groupby().first()`        | Keep one entry per group based on custom logic               | First login per user                                 |
| `sort().drop_duplicates()` | Retain record with highest or lowest metric (e.g., purchase) | Keep highest transaction per customer                |
| `flag column`              | Retain info about duplicates instead of dropping             | Add `is_duplicate` flag for fraud or audit use cases |


### 🧠 Best Practices

* Always inspect data using `duplicated()` before dropping.
* If working on transactional or customer data, define **what makes a record unique** (e.g., ID + Date).
* Consider **domain knowledge**: some "duplicates" may be valid repeat entries.


<center><b>Thanks</b></center>