# Handling Duplicates Values
Duplicates are records that represent the same real-world entity,
even if they are not always identical row by row.

## Dataset summary
- **Exact Duplicates**: Completely identical rows (e.g., `Riya` and `Amit` appear twice with exact same data).
- **Partial Duplicates**: Rows where the person is the same (`name`), but other info has changed (e.g., `Siddharth` has a new salary/city).
- **Case Sensitivity Issues**: `Neha` appears as both "Neha" and "neha". Standard duplicate checks will miss this.
- **Whitespace Issues**: One instance of `Pune` has a leading space (`" Pune"`). This is a common real-world data cleaning challenge.
- **Subset Conflict**: `Amit` appears with the same name/gender but a different age/joining date.

In [1]:
import pandas as pd, numpy as np

In [2]:
df = pd.read_csv("08_duplicates_data.csv")
df

Unnamed: 0,name,gender,age,salary,city,date of joining
0,Vijay,Male,40,93707,Chennai,06-09-2022
1,Liam,Male,45,97001,Kolkata,16-09-2018
2,Amit,Male,28,40769,Kolkata,30-10-2021
3,Riya,Female,41,99735,Pune,09-02-2018
4,Sonia,Female,32,46396,Hyderabad,13-03-2019
5,Priya,Female,44,59769,Delhi,03-09-2022
6,Sara,Female,42,81434,Kolkata,24-09-2018
7,Riya,Female,41,99735,Pune,09-02-2018
8,neha,Female,32,42433,Kolkata,04-03-2020
9,Zoe,Female,42,77819,Bangalore,28-11-2023


## Checking Duplicates Values
- Check Duplicates values - `.duplicated()`

### Count total exact duplicates

In [3]:
df.duplicated().sum()

np.int64(2)

### View the actual duplicate rows

In [4]:
df[df.duplicated(keep=False)].sort_values(by='name')

Unnamed: 0,name,gender,age,salary,city,date of joining
2,Amit,Male,28,40769,Kolkata,30-10-2021
20,Amit,Male,28,40769,Kolkata,30-10-2021
10,John,Male,36,96101,Pune,02-06-2019
15,John,Male,36,96101,Pune,02-06-2019


## Removing Duplicates

### Exact Duplicates
Keep the first occurrence and remove others

In [5]:
df.drop_duplicates(keep='first')

Unnamed: 0,name,gender,age,salary,city,date of joining
0,Vijay,Male,40,93707,Chennai,06-09-2022
1,Liam,Male,45,97001,Kolkata,16-09-2018
2,Amit,Male,28,40769,Kolkata,30-10-2021
3,Riya,Female,41,99735,Pune,09-02-2018
4,Sonia,Female,32,46396,Hyderabad,13-03-2019
5,Priya,Female,44,59769,Delhi,03-09-2022
6,Sara,Female,42,81434,Kolkata,24-09-2018
7,Riya,Female,41,99735,Pune,09-02-2018
8,neha,Female,32,42433,Kolkata,04-03-2020
9,Zoe,Female,42,77819,Bangalore,28-11-2023


### Handling "Hidden" Duplicates (Data Normalization)
Convert names to lowercase and strip whitespace from city

In [6]:
df['name'] = df['name'].str.lower()
df['city'] = df['city'].str.strip()
df.drop_duplicates(inplace=True)

## Handling Partial Duplicates (Subsets)
The subset parameter helps identify duplicates based on
business logic rather than the entire row.

- If rule is "One entry per name", use subset
- keep='last' might be used to keep the most recent record

In [7]:
df.drop_duplicates(subset=['name'], keep='last', inplace=True)
df

Unnamed: 0,name,gender,age,salary,city,date of joining
0,vijay,Male,40,93707,Chennai,06-09-2022
1,liam,Male,45,97001,Kolkata,16-09-2018
3,riya,Female,41,99735,Pune,09-02-2018
4,sonia,Female,32,46396,Hyderabad,13-03-2019
5,priya,Female,44,59769,Delhi,03-09-2022
6,sara,Female,42,81434,Kolkata,24-09-2018
8,neha,Female,32,42433,Kolkata,04-03-2020
9,zoe,Female,42,77819,Bangalore,28-11-2023
10,john,Male,36,96101,Pune,02-06-2019
11,mike,Male,45,67480,Bangalore,09-02-2023


**Note**: Dropping duplicates without understanding the context can lead to loss of important information.

## Summary
- Always inspect duplicates before removing them
- Use subset to define meaningful duplicates
- Data consistency is as important as duplicate removal