# 🧠 Leetcode 182 — Duplicate Emails (Databricks Edition)

---

## 📘 Problem Statement

### Table: Person

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| email       | varchar |

- `id` is the primary key.
- Each row contains an email.
- Emails will not contain uppercase letters.
- The `email` field is guaranteed to be **not NULL**.

---

## 🎯 Objective

Write a query to report **all duplicate emails**.  
Return the result table in any order.

---

## 🧾 Example

### Input

**Person Table**

| id | email   |
|----|---------|
| 1  | a@b.com |
| 2  | c@d.com |
| 3  | a@b.com |

### Output

| Email   |
|---------|
| a@b.com |

### Explanation

- `a@b.com` appears twice.
- Only emails that appear **more than once** should be returned.

---

## 🧱 PySpark DataFrame Creation

```python
from pyspark.sql import Row

# Sample data
person_data = [
    Row(id=1, email="a@b.com"),
    Row(id=2, email="c@d.com"),
    Row(id=3, email="a@b.com")
]

# Create DataFrame
person_df = spark.createDataFrame(person_data)

# Register temp view
person_df.createOrReplaceTempView("Person")
```

---

## ✅ SQL Solution

```sql
SELECT email AS Email
FROM Person
GROUP BY email
HAVING COUNT(*) > 1;
```

---

## 🧪 PySpark Solution

```python
from pyspark.sql import functions as F

result_df = person_df.groupBy("email") \
                     .agg(F.count("*").alias("cnt")) \
                     .filter("cnt > 1") \
                     .select(F.col("email").alias("Email"))

result_df.show()
```

---

📘 *This notebook is part of DataGym’s SQL-to-PySpark transition series. Want to build a reusable template for aggregation-based problems? Let’s co-create it!*



In [0]:
# Step 1: Create sample data
data = [
    (1, "a@b.com"),
    (2, "c@d.com"),
    (3, "a@b.com")
]

# Step 2: Define schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), nullable=False),
    StructField("email", StringType(), nullable=False)
])

# Step 3: Create DataFrame
df = spark.createDataFrame(data, schema)

# Step 4: Create temporary view
df.createOrReplaceTempView("Person")


In [0]:
from pyspark.sql.functions import count

df.groupBy("email").agg(count("email").alias("cnt")).filter("cnt > 1").selectExpr("email AS Email").display()

note : it is selectExpr not selectexpr , E is capital else will throw error.

In [0]:
df.groupBy("email") \
  .agg(count("email").alias("cnt")) \
  .filter("cnt > 1") \
  .selectExpr("email AS Email") \
  .display()

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Define the window specification
window_spec = Window.partitionBy("email").orderBy("id")

# Add row_number column
df_with_rn = df.withColumn("rn", row_number().over(window_spec))

# Filter rows where rn > 1 (i.e., duplicates)
duplicate_emails = df_with_rn.filter("rn > 1").selectExpr("email AS Email")

# Display result
duplicate_emails.display()

In [0]:

# Step 5: SQL query to find duplicate emails
duplicate_emails = spark.sql("""
    SELECT email AS Email
    FROM Person
    GROUP BY email
    HAVING COUNT(*) > 1
""")

# Step 6: Show result
duplicate_emails.show()