# LeetCode 196: Delete Duplicate Emails

### Problem Statement

**Table: Person**

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| email       | varchar |

`id` is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

**Task:**
Write a solution to **delete** all duplicate emails, keeping only one unique email with the **smallest id**.

**Note:**
For SQL users, please note that you are supposed to write a `DELETE` statement and not a `SELECT` one.

**Example 1:**

**Input:**
Person table:
| id | email            |
|----|------------------|
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |

**Output:**
| id | email            |
|----|------------------|
| 1  | john@example.com |
| 2  | bob@example.com  |

**Explanation:** john@example.com is repeated two times. We keep the row with the smallest Id = 1.

In [1]:
import sqlite3
import pandas as pd

# 1. Setup SQLite in-memory database
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()

# 2. Helper function to display query results as a Pandas DataFrame
def show(query):
    return pd.read_sql_query(query, conn)

print("Environment setup complete. Database ready.")

Environment setup complete. Database ready.


### Schema Description

The schema consists of a single table, `Person`.

This scenario represents a **Data Cleaning** operation. In production environments, this table might represent a User Registry where a race condition or lack of unique constraint allowed duplicate registrations.

*   **Primary Key (`id`):** The unique identifier. Crucially, the problem states we must keep the *smallest* ID. This usually implies keeping the *oldest* record (assuming auto-incrementing IDs).
*   **Data Column (`email`):** The content which is duplicated.

In [2]:
# Create the Person table
create_table_sql = """
CREATE TABLE Person (
    id INTEGER PRIMARY KEY,
    email VARCHAR(255)
);
"""

cursor.execute(create_table_sql)
conn.commit()
print("Table 'Person' created successfully.")

Table 'Person' created successfully.


### Sample Data

We will populate the database with the LeetCode example.

**Data Analysis:**
*   **Row 1 (id:1):** `john@example.com` (Keep this - it's the original).
*   **Row 2 (id:2):** `bob@example.com` (Unique - Keep).
*   **Row 3 (id:3):** `john@example.com` (Duplicate of id:1. Delete this).

In [3]:
# Insert sample data
insert_data_sql = """
INSERT INTO Person (id, email) VALUES
(1, 'john@example.com'),
(2, 'bob@example.com'),
(3, 'john@example.com');
"""

cursor.execute(insert_data_sql)
conn.commit()
print("Sample data inserted.")

Sample data inserted.


### ðŸŽ“ Lecture: The Logic of Deduplication (DELETE vs. SELECT)

As a Senior Data Professional, writing a `SELECT` statement to find duplicates is safe. Writing a `DELETE` statement based on joins is **high-risk**. If you mess up the logic, you lose data permanently.

#### 1. The "Keep" vs. "Kill" Strategy
There are two mental models for solving this:
1.  **The "Kill" Strategy (Self-Join):** Identify explicitly the rows that are redundant (e.g., "Find rows where the email is the same, but the ID is larger").
2.  **The "Keep" Strategy (Window/Aggregate):** Identify the "Golden Records" (e.g., "Find the minimum ID for every email") and delete everything else.

#### 2. Visualizing the Comparison (Cross Join Logic)
To understand the "Kill" strategy, imagine comparing every row with every other row.

**ASCII Matrix of Comparison:**

    Row A (id:1, john)  vs  Row B (id:3, john)
    ------------------------------------------
    1. Are emails same?  YES (john@example.com)
    2. Is A.id > B.id?   NO  (1 is not > 3)
    -> Conclusion: Row A is the "original". Don't delete.

    Row B (id:3, john)  vs  Row A (id:1, john)
    ------------------------------------------
    1. Are emails same?  YES
    2. Is B.id > A.id?   YES (3 > 1)
    -> Conclusion: Row B is the "duplicate". DELETE IT.

#### 3. SQL Syntax Variations (Important for Interviews)
This specific LeetCode problem is infamous because the solution syntax varies heavily between SQL Dialects.

**A. The MySQL Syntax (LeetCode Default):**
MySQL allows joining tables *inside* the DELETE statement.
```sql
DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email AND p1.id > p2.id;
```
**Explanation:** "Delete from alias p1 if it matches p2 on email, but p1 has the larger ID."

**B. The ANSI SQL Syntax (Postgres, SQLite, Oracle):**
Standard SQL often forbids joining in the DELETE clause. You must use a Subquery.
```sql
DELETE FROM Person
WHERE id NOT IN (
    SELECT MIN(id)
    FROM Person
    GROUP BY email
);
```

**Explanation:** "Calculate the list of IDs we want to SAVE (the minimums). Delete everyone else."


#### 4. Safely Running DELETEs in Production
In a real job, NEVER run a DELETE directly.

Run as SELECT first: SELECT * FROM Person WHERE ... to verify what will be gone.

**Use Transactions:** BEGIN TRANSACTION; DELETE ...; ROLLBACK; (or COMMIT).

Soft Deletes: Often, we don't actually delete. We verify a column is_deleted = true.

### Step-by-Step Reasoning for the Solution

Since we are running this notebook in **SQLite**, we will use the **ANSI SQL** standard approach (The "Keep" Strategy). It is more robust and universally understood than the MySQL-specific join syntax.

**Logical Steps:**

1.  **Identify the Keepers:** We want to group records by `email` and find the `MIN(id)` for each group.
    *   Query: `SELECT MIN(id) FROM Person GROUP BY email`
    *   Result: `[1, 2]` (John's ID 1, Bob's ID 2).
2.  **Identify the Targets:** Any ID that is **NOT IN** this list of keepers must be a duplicate with a higher ID.
3.  **Perform Deletion:** Delete rows where the `id` is not in the "Keeper List".

**Drafting the Query:**

    DELETE FROM Person
    WHERE id NOT IN (
        SELECT MIN(id)
        FROM Person
        GROUP BY email
    );

In [6]:
# Final SQL Solution
# Note: We execute this using cursor.execute() because it is a DML (Data Modification Language) operation,
# not a SELECT query that returns a dataframe.

delete_query = """
DELETE FROM Person
WHERE id NOT IN (
    select id from (
    SELECT MIN(id) as id
    FROM Person
    GROUP BY email
) as temp);
"""

cursor.execute(delete_query)
conn.commit()
print("Delete operation executed.")

Delete operation executed.


### Output Verification

Since `DELETE` statements do not return a table, we must run a `SELECT` statement afterwards to verify the state of the table.

**Expected State:**
- **John:** ID 1 should exist. ID 3 should be gone.
- **Bob:** ID 2 should exist.

**Expected Output Table:**

| id | email            |
|----|-----------------|
| 1  | john@example.com |
| 2  | bob@example.com  |


In [7]:
# Verify the remaining data
verify_query = "SELECT * FROM Person ORDER BY id;"
show(verify_query)

Unnamed: 0,id,email
0,1,john@example.com
1,2,bob@example.com


### Summary and Key Takeaways

1.  **Destructive Operations:** Always treat `DELETE` with extreme caution. In this notebook, we used the "Keep Strategy" (Keep MIN ID), which is safer because it explicitly defines the "Golden Records" first.
2.  **Syntax Differences:** Be aware that `DELETE p1 FROM p1 JOIN p2` is specific to MySQL (and T-SQL with variation). The `WHERE id NOT IN (...)` pattern works on almost every relational database (SQLite, Postgres, Oracle).
3.  **Self-Joins Logic:** Even though we used a subquery, the concept is fundamentally a self-comparison: comparing the set of IDs against the set of "Best" IDs for the same email.
4.  **Edge Cases:**
    *   **All Unique:** The `MIN(id)` list includes everyone. `NOT IN` excludes nothing. 0 rows deleted. Correct.
    *   **Triple Duplicate:** If John had IDs 1, 3, 5. `MIN(id)` is 1. IDs 3 and 5 are NOT IN [1]. Both deleted. Correct.