# Data Deduplication

### Scenario:
A CSV contains user data but has duplicates based on email.

## Task:

- Read the file.
- Drop duplicates, keeping the most recent entry based on last_updated column.
- Output the cleaned file

### Import pandas for data manipulation

In [1]:
import pandas as pd

## 🔹 LOAD :


### Step 1: Read the user data from a CSV file


This file may contain multiple rows with the same email


In [2]:
df = pd.read_csv('users.csv')  

In [3]:
df

Unnamed: 0,user_id,name,email,last_updated
0,U001,Alice,alice@example.com,01-03-2024 10:00
1,U002,Bob,bob@example.com,02-03-2024 12:00
2,U003,Alice,alice@example.com,03-03-2024 14:00
3,U004,Charlie,charlie@example.com,04-03-2024 16:00


### Step 2: Convert the 'last_updated' column into datetime format

This helps us compare which entry is more recent


In [4]:
df['last_updated'] = pd.to_datetime(df['last_updated'])

## 🔹 DEDUPLICATE :


### Step 3: Sort all entries by 'last_updated' so that the latest record comes last


In [5]:
df_sorted = df.sort_values('last_updated')

### Step 4: Remove duplicate emails, keeping only the most recent entry for each email


In [6]:
df_cleaned = df_sorted.drop_duplicates(subset='email', keep='last')

## 🔹 SAVE :


### Step 5: Save the cleaned, deduplicated user data to a new file


In [7]:
df_cleaned.to_csv('deduplicated_users.csv', index=False)

### Confirmation message

In [8]:
print("Deduplication complete. Cleaned file saved as 'deduplicated_users.csv'")

Deduplication complete. Cleaned file saved as 'deduplicated_users.csv'


### Step 6 (optional): Load the cleaned CSV file that we just saved

This helps to verify if our ETL pipeline worked correctly


In [10]:
df_result = pd.read_csv('deduplicated_users.csv')

### Display the first 5 rows of the cleaned data

This gives a quick preview of the output without opening the file separately


In [11]:
df_result

Unnamed: 0,user_id,name,email,last_updated
0,U002,Bob,bob@example.com,2024-02-03 12:00:00
1,U003,Alice,alice@example.com,2024-03-03 14:00:00
2,U004,Charlie,charlie@example.com,2024-04-03 16:00:00
