# **Data Transformation**

## **7. Handling duplicates or unique transformations**

In [1]:
import numpy as np 
import pandas as pd 

## ✅ 1. What it Does and When to Use It

Handling duplicates or extracting unique values is essential during data preprocessing to:

* **Eliminate redundancy**
* **Ensure data quality**
* **Reduce noise**
* **Prevent bias or skew** in analysis or modeling

You'll typically do this **after data ingestion or merging** and **before modeling or analysis**.


## 🧠 2. Syntax and Core Parameters

### ➤ Main Functions

| Function                   | Description                                  |
| -------------------------- | -------------------------------------------- |
| `df.duplicated()`          | Returns Boolean Series indicating duplicates |
| `df.drop_duplicates()`     | Drops duplicate rows                         |
| `df.nunique()`             | Counts number of unique values per column    |
| `df['col'].unique()`       | Returns unique values in a Series            |
| `df['col'].value_counts()` | Counts frequency of unique values            |

### ➤ Core Parameters for `drop_duplicates()`:

```python
df.drop_duplicates(
    subset=None,        # Columns to consider
    keep='first',       # Which duplicates to keep: 'first', 'last', False
    inplace=False,      # Modify the original DataFrame or return new one
    ignore_index=False  # Reset index or not
)
```


## 🧪 3. Methods and Techniques

### A. **Detect Duplicates**

```python
df.duplicated()                  # True if row is duplicate of a previous row
df.duplicated(subset=['name'])  # Check duplicates in 'name' column only
```

---

### B. **Drop Duplicates**

```python
df.drop_duplicates()                             # Drop all duplicate rows
df.drop_duplicates(subset=['email'], keep='last')  # Keep last for 'email'
```

---

### C. **Get Unique Values**

```python
df['gender'].unique()
df['gender'].nunique()          # Count unique values
```

---

### D. **Count Value Frequencies**

```python
df['city'].value_counts()
```

---

### E. **Group and Remove Duplicates by Aggregation**

```python
df.groupby('email').agg('first').reset_index()
```


## ⚠️ 4. Common Pitfalls and Best Practices

| Pitfall                                             | Explanation                                                         |
| --------------------------------------------------- | ------------------------------------------------------------------- |
| Forgetting `subset=`                                | May drop entire row based on full match instead of specific columns |
| Not resetting index after dropping                  | Can cause confusion in further operations                           |
| Using `inplace=True` without need                   | Risk of losing original data                                        |
| `duplicated()` includes first as `False` by default | Only later duplicates marked as `True`                              |

### ✅ Best Practices:

* Always **verify before dropping** with `df[df.duplicated()]`
* When unsure, **don't use `inplace=True`** to avoid accidental data loss
* Use **subset** parameter to focus on specific columns
* Use **`.copy()`** when extracting unique subsets for further processing


## 📊 5. Examples on Real/Pseudo Data

### 🔸 Example 1: Detect & Remove Duplicate Rows

In [2]:
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 2],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob'],
    'email': ['a@mail.com', 'b@mail.com', 'c@mail.com', 'd@mail.com', 'b@mail.com']
})

df

Unnamed: 0,id,name,email
0,1,Alice,a@mail.com
1,2,Bob,b@mail.com
2,3,Charlie,c@mail.com
3,4,David,d@mail.com
4,2,Bob,b@mail.com


In [3]:
# Find duplicate rows based on 'id'
df[df.duplicated(subset='id')]

Unnamed: 0,id,name,email
4,2,Bob,b@mail.com


In [4]:
# Drop duplicates based on 'id', keep the first
df_unique = df.drop_duplicates(subset='id', keep='first')
df_unique

Unnamed: 0,id,name,email
0,1,Alice,a@mail.com
1,2,Bob,b@mail.com
2,3,Charlie,c@mail.com
3,4,David,d@mail.com


### 🔸 Example 2: Get Unique Values and Frequencies

In [6]:
df['name'].unique()         # ['Alice', 'Bob', 'Charlie', 'David']

array(['Alice', 'Bob', 'Charlie', 'David'], dtype=object)

In [7]:
df['name'].nunique()        # 4

4

In [8]:
df['name'].value_counts()   # Shows counts of each name

name
Bob        2
Alice      1
Charlie    1
David      1
Name: count, dtype: int64

### 🔸 Example 3: Remove Duplicates and Aggregate

In [9]:
df = pd.DataFrame({
    'email': ['x@mail.com', 'x@mail.com', 'y@mail.com'],
    'score': [80, 90, 75]
})

df

Unnamed: 0,email,score
0,x@mail.com,80
1,x@mail.com,90
2,y@mail.com,75


In [11]:
# Keep only the highest score per email
df_clean = df.groupby('email', as_index=False)['score'].max()
df_clean

Unnamed: 0,email,score
0,x@mail.com,90
1,y@mail.com,75


## 🌍 6. Real-World Use Cases

| Scenario                    | Application                                                       |
| --------------------------- | ----------------------------------------------------------------- |
| **Customer database**       | Remove duplicate customers based on email or phone number         |
| **Merging datasets**        | After merging, drop rows with duplicate IDs                       |
| **E-commerce transactions** | Eliminate repeated transactions                                   |
| **Surveys**                 | Remove repeated entries from the same respondent                  |
| **Sensor readings**         | Deduplicate records based on timestamp and sensor ID              |
| **Marketing leads**         | Use `drop_duplicates(subset='email')` to get distinct prospects   |
| **Job portal**              | Remove multiple applications from the same user for the same role |


## ✅ Summary Table

| Task                   | Function            |
| ---------------------- | ------------------- |
| Detect duplicates      | `duplicated()`      |
| Remove duplicates      | `drop_duplicates()` |
| Count unique values    | `nunique()`         |
| List unique values     | `unique()`          |
| Count frequency        | `value_counts()`    |
| Deduplicate with logic | `groupby().agg()`   |


<center><b>Thanks</b></center>