# LeetCode 182: Duplicate Emails

### Problem Statement

**Table: Person**

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| email       | varchar |

`id` is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

**Task:**
Write a solution to report all the duplicate emails. Note that it's guaranteed that the email field is not NULL.

**Example 1:**

**Input:**
Person table:
| id | email   |
|----|---------|
| 1  | a@b.com |
| 2  | c@d.com |
| 3  | a@b.com |

**Output:**
| Email   |
|---------|
| a@b.com |

**Explanation:** a@b.com is repeated two times.

In [1]:
import sqlite3
import pandas as pd

# 1. Setup SQLite in-memory database
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()

# 2. Helper function to display query results as a Pandas DataFrame
def show(query):
    return pd.read_sql_query(query, conn)

print("Environment setup complete. Database ready.")

Environment setup complete. Database ready.


### Schema Description

The schema consists of a single table, `Person`.

This is a classic "Flat Table" often found in User Identity systems or Contact lists.

*   **Primary Key (`id`):** This guarantees row uniqueness (technical uniqueness).
*   **Business Key (`email`):** In a perfectly normalized system, `email` might be a `UNIQUE` key. However, the existence of this problem implies that **Business Key constraints are missing or violated**.

In Data Engineering, this represents a **Data Quality (DQ)** issue. We have "Technical Uniqueness" (different IDs) but "Business Duplication" (same person/email).

In [2]:
# Create the Person table
create_table_sql = """
CREATE TABLE Person (
    id INTEGER PRIMARY KEY,
    email VARCHAR(255) NOT NULL
);
"""

cursor.execute(create_table_sql)
conn.commit()
print("Table 'Person' created successfully.")

Table 'Person' created successfully.


### Sample Data

We will populate the database with the example data provided in the LeetCode problem.

**Analysis of Data:**
*   **Row 1:** a@b.com (First occurrence)
*   **Row 2:** c@d.com (Unique)
*   **Row 3:** a@b.com (Duplicate of Row 1)

In [3]:
# Insert sample data
insert_data_sql = """
INSERT INTO Person (id, email) VALUES
(1, 'a@b.com'),
(2, 'c@d.com'),
(3, 'a@b.com');
"""

cursor.execute(insert_data_sql)
conn.commit()
print("Sample data inserted.")

Sample data inserted.


### ðŸŽ“ Lecture: Aggregation, Grouping, and The Order of Operations

As a Senior Data Professional, identifying duplicates is 50% of your job in **ETL pipelines** (Extract, Transform, Load). Before loading data into a Production Data Warehouse, you must ensure referential integrity.

To solve this, we rely on the **Split-Apply-Combine** strategy, implemented in SQL via `GROUP BY`.

#### 1. The Anatomy of Aggregation
We cannot check for duplicates by looking at one row at a time. We must look at the **Set** of data.

**The Logic Flow:**
1.  **Split:** Divide the table into buckets based on the `email` value.
2.  **Apply:** Count the number of items in each bucket.
3.  **Combine/Filter:** Return only the buckets where the count is $> 1$.

#### 2. ASCII Visual: The Grouping Process

    STEP 1: Original Table
    +----+---------+
    | id | email   |
    +----+---------+
    | 1  | a@b.com |
    | 2  | c@d.com |
    | 3  | a@b.com |

    STEP 2: GROUP BY email (The "Buckets")
    
    Bucket 'a@b.com':  [ {id:1}, {id:3} ]
    Bucket 'c@d.com':  [ {id:2} ]

    STEP 3: AGGREGATE (Count rows in buckets)
    
    'a@b.com' -> Count: 2
    'c@d.com' -> Count: 1

    STEP 4: FILTER (HAVING Count > 1)
    
    'a@b.com' -> KEEP (2 > 1)
    'c@d.com' -> DROP (1 is not > 1)

#### 3. WHERE vs. HAVING (Crucial Concept)
This is the #1 Interview Question for Junior/Mid SQL roles.

*   **`WHERE` clause:** Filters rows *before* they are grouped.
    *   *Can we use WHERE?* No. We don't know the count until *after* we group.
    *   *Usage:* `WHERE email LIKE '%.com'` (This happens pre-aggregation).
*   **`HAVING` clause:** Filters groups *after* aggregation is calculated.
    *   *Usage:* `HAVING COUNT(id) > 1`.

#### 4. Relational Algebra
In theoretical terms, this operation uses the **Grouping Operator** ($\gamma$).

$$ \sigma_{count > 1} ( \gamma_{email, COUNT(id)} (Person) ) $$

1.  $\gamma$ (Gamma): Group by email.
2.  $\sigma$ (Sigma): Select/Filter where count > 1.

#### 5. Alternative Approach: Self-Join
While `GROUP BY` is preferred for performance ($O(N \log N)$ or $O(N)$ with hashing), you *could* solve this with a self-join.

```sql
SELECT DISTINCT p1.email
FROM Person p1
JOIN Person p2
  ON p1.email = p2.email  -- Same email
 AND p1.id != p2.id       -- Different ID (meaning different physical row)


 ```

This logic says: "Find me a row that matches another row on email, but isn't the same row."
**Why avoid this?**
##### Performance: It can create a huge number of intermediate rows if there are many duplicates (Cartesian explosion within the duplicate set).
##### Readability: GROUP BY explicitly states intent (counting).

### Step-by-Step Reasoning for the Solution

We need to extract emails appearing more than once.

**Logical Steps:**
1.  **Select Column:** We are interested in `email`.
2.  **Grouping:** We must group the data by `email` to calculate statistics for each unique address.
3.  **Counting:** Inside each group, we count the occurrences. `COUNT(email)` or `COUNT(id)` or `COUNT(*)` all work here since `email` is not NULL.
4.  **Filtering:** We only want groups where the count is strictly greater than 1.
5.  **Ordering:** The problem states "Return the result table in any order," so no `ORDER BY` is needed.

**Drafting the Query:**

    SELECT
        email
    FROM
        Person
    GROUP BY
        email
    HAVING
        COUNT(email) > 1;

In [4]:
# Final SQL Solution
final_query = """
SELECT
    email
FROM
    Person
GROUP BY
    email
HAVING
    COUNT(email) > 1;
"""

### Output Verification

Let's trace the execution logic:

1.  **Group: a@b.com**
    *   Rows found: id=1, id=3.
    *   Count: 2.
    *   Condition `2 > 1`: **True**.
    *   Action: Keep.

2.  **Group: c@d.com**
    *   Rows found: id=2.
    *   Count: 1.
    *   Condition `1 > 1`: **False**.
    *   Action: Discard.

**Expected Output:**
| Email   |
|---------|
| a@b.com |

In [5]:
# Execute and show final results
show(final_query)

Unnamed: 0,email
0,a@b.com


### Summary and Key Takeaways

1.  **The Pattern:** Finding duplicates is almost always solved with `GROUP BY [column] HAVING COUNT([column]) > 1`. Memorize this pattern.
2.  **Execution Order:** Remember that `HAVING` runs after aggregation, while `WHERE` runs before. You cannot filter by a count in a `WHERE` clause.
3.  **Data Quality:** In production, simply finding duplicates isn't enough. You usually need to *remove* them. That would require a more complex query (often using Window Functions like `ROW_NUMBER()`) to keep one instance (e.g., the one with the lowest ID) and delete the rest.
4.  **Scalability:** Grouping operations are memory-intensive (the database builds a hash table). On massive datasets, ensure your grouping columns are indexed.