# 🧠 Leetcode 175 — Combine Two Tables (Databricks Edition)

---

## 📘 Problem Statement

You are given two tables:

### Table: Person

| Column Name | Type    |
|-------------|---------|
| personId    | int     |
| lastName    | varchar |
| firstName   | varchar |

- `personId` is the primary key.
- This table contains information about the ID of some persons and their first and last names.

---

### Table: Address

| Column Name | Type    |
|-------------|---------|
| addressId   | int     |
| personId    | int     |
| city        | varchar |
| state       | varchar |

- `addressId` is the primary key.
- Each row contains information about the city and state of one person with ID = `personId`.

---

## 🎯 Objective

Write a query to report the `firstName`, `lastName`, `city`, and `state` of each person in the `Person` table.  
If the address of a `personId` is not present in the `Address` table, report `null` instead.

Return the result table in any order.

---

## 🧾 Example

### Input

**Person Table**

| personId | lastName | firstName |
|----------|----------|-----------|
| 1        | Wang     | Allen     |
| 2        | Alice    | Bob       |

**Address Table**

| addressId | personId | city          | state      |
|-----------|----------|---------------|------------|
| 1         | 2        | New York City | New York   |
| 2         | 3        | Leetcode      | California |

### Output

| firstName | lastName | city          | state    |
|-----------|----------|---------------|----------|
| Allen     | Wang     | null          | null     |
| Bob       | Alice    | New York City | New York |

---


In [0]:
from pyspark.sql import Row

# Sample data
person_data = [
    Row(personId=1, lastName="Wang", firstName="Allen"),
    Row(personId=2, lastName="Alice", firstName="Bob")
]

address_data = [
    Row(addressId=1, personId=2, city="New York City", state="New York"),
    Row(addressId=2, personId=3, city="Leetcode", state="California")
]

# Create DataFrames
person_df = spark.createDataFrame(person_data)
address_df = spark.createDataFrame(address_data)

# Register temp views
person_df.createOrReplaceTempView("Person")
address_df.createOrReplaceTempView("Address")

In [0]:
%sql
SELECT 
    p.firstName,
    p.lastName,
    a.city,
    a.state
FROM Person p
LEFT JOIN Address a
ON p.personId = a.personId;

In [0]:
from pyspark.sql import functions as F

result_df = person_df.join(
    address_df,
    on="personId",
    how="left"
).select(
    "firstName", "lastName", "city", "state"
)

result_df.display()

# 🧠 PySpark Join Disambiguation — Handling Overlapping Column Names

When joining two DataFrames that share column names (other than the join key), PySpark will raise ambiguity unless you explicitly handle the overlaps. Here's how to resolve it cleanly and confidently.

---

## 🧩 Problem

If both DataFrames have a column named `"id"` and `"name"`, and you join on `"id"`, PySpark won’t know which `"name"` to use unless you disambiguate.

---

## ✅ Solutions

### 1. 🔧 Use Aliases Before Join

Rename columns using `.alias()` or `selectExpr()` to avoid conflicts:

```python
df1 = df1.selectExpr("id", "name as name_df1")
df2 = df2.selectExpr("id", "name as name_df2")

joined_df = df1.join(df2, on="id", how="inner")
```

---

### 2. 🧹 Drop Duplicate Columns After Join

If you only need one version of the overlapping column:

```python
joined_df = df1.join(df2, on="id", how="inner")
joined_df = joined_df.drop(df2["name"])  # Keep df1's "name"
```

---

### 3. 🎯 Select Specific Columns Post-Join

Explicitly choose which columns to keep:

```python
joined_df = df1.join(df2, on="id", how="inner") \
               .select(df1["id"], df1["name"], df2["age"])
```

---

### 4. 🧠 Use Aliases for DataFrames

This is handy for SQL-style joins and readable column references:

```python
df1_alias = df1.alias("a")
df2_alias = df2.alias("b")

joined_df = df1_alias.join(df2_alias, df1_alias["id"] == df2_alias["id"]) \
                     .select("a.id", "a.name", "b.name", "b.age")
```

---

## 🧪 Bonus Tip for Debugging

Use `.printSchema()` after join to inspect column lineage and catch any surprises:

```python
joined_df.printSchema()
```

---

📘 *Want a reusable function or cheat sheet for this? I can help you build a PySpark join disambiguation module tailored for DataGym contributors—complete with examples, edge cases, and visual metaphors!*
