# 🧠 Leetcode 181 — Employees Earning More Than Their Managers (Databricks Edition)

---

## 📘 Problem Statement

### Table: Employee

| Column Name | Type    |
|-------------|---------|
| id          | int     |
| name        | varchar |
| salary      | int     |
| managerId   | int     |

- `id` is the primary key.
- Each row indicates the ID of an employee, their name, salary, and the ID of their manager.

---

## 🎯 Objective

Write a query to find the employees who earn **more than their managers**.

Return the result table in any order.

---

## 🧾 Example

### Input

**Employee Table**

| id | name  | salary | managerId |
|----|-------|--------|-----------|
| 1  | Joe   | 70000  | 3         |
| 2  | Henry | 80000  | 4         |
| 3  | Sam   | 60000  | Null      |
| 4  | Max   | 90000  | Null      |

### Output

| Employee |
|----------|
| Joe      |

### Explanation

Joe earns 70000, while his manager (Sam) earns 60000.  
Henry earns 80000, but his manager (Max) earns 90000 — so he’s excluded.

---

## 🧱 PySpark DataFrame Creation

```python
from pyspark.sql import Row

# Sample data
employee_data = [
    Row(id=1, name="Joe", salary=70000, managerId=3),
    Row(id=2, name="Henry", salary=80000, managerId=4),
    Row(id=3, name="Sam", salary=60000, managerId=None),
    Row(id=4, name="Max", salary=90000, managerId=None)
]

# Create DataFrame
employee_df = spark.createDataFrame(employee_data)

# Register temp view
employee_df.createOrReplaceTempView("Employee")
```

---

## ✅ SQL Solution

```sql
SELECT e.name AS Employee
FROM Employee e
JOIN Employee m
ON e.managerId = m.id
WHERE e.salary > m.salary;
```

---

## 🧪 PySpark Solution

```python
from pyspark.sql import functions as F

emp = employee_df.alias("e")
mgr = employee_df.alias("m")

result_df = emp.join(
    mgr,
    emp["managerId"] == mgr["id"],
    how="inner"
).filter(
    emp["salary"] > mgr["salary"]
).select(
    emp["name"].alias("Employee")
)

result_df.show()
```

---

📘 *This notebook is part of DataGym’s SQL-to-PySpark transition series. Want to build a reusable template for join-based logic problems? Let’s co-create it!*


In [0]:
from pyspark.sql import Row

# Sample data
employee_data = [
    Row(id=1, name="Joe", salary=70000, managerId=3),
    Row(id=2, name="Henry", salary=80000, managerId=4),
    Row(id=3, name="Sam", salary=60000, managerId=None),
    Row(id=4, name="Max", salary=90000, managerId=None)
]

# Create DataFrame
employee_df = spark.createDataFrame(employee_data)

# Register temp view
employee_df.createOrReplaceTempView("Employee")

In [0]:
df_employee = employee_df.selectExpr("id as emp_id ", "name as emp_name","salary as emp_salary","managerId as emp_managerId")
df_manager = employee_df.selectExpr("id as mgr_id ", "name as mgr_name","salary as mgr_salary","managerId as mgr_managerId")


In [0]:
df_employee.join(df_manager,df_employee["emp_managerId"] == df_manager["mgr_id"],how ="inner")\
    .filter(df_employee["emp_salary"] > df_manager["mgr_salary"])\
    .selectExpr("emp_name as Employee")\
    .display()



```markdown
In PySpark, you can reference and manipulate columns in several expressive ways depending on the context—whether you're selecting, transforming, filtering, or aliasing. Here's a unified cheat sheet that captures the **different ways to write and use columns in PySpark**:

---

## 🧠 PySpark Column Syntax Cheat Sheet

### 🔹 1. **Dot Notation**
```python
df.colName
df.salary
```
- Simple and readable.
- Works only if `colName` is a valid Python identifier (no spaces or special characters).

---

### 🔹 2. **Bracket Notation**
```python
df["colName"]
df["salary"]
```
- More flexible.
- Useful when column names have spaces, dots, or special characters.

---

### 🔹 3. **`col()` Function**
```python
from pyspark.sql.functions import col

col("salary")
col("employee.name")
```
- Preferred in transformations and filters.
- Enables chaining like `.alias()`, `.cast()`, `.substr()`.

---

### 🔹 4. **`df.selectExpr()` with SQL Expressions**
```python
df.selectExpr("salary * 1.1 as updated_salary", "name")
```
- Powerful for inline calculations and aliasing.
- Accepts SQL-like strings.

---

### 🔹 5. **`df.select()` with Column Objects**
```python
df.select(col("salary").alias("updated_salary"), col("name"))
```
- Explicit and readable.
- Great for chaining transformations.

---

### 🔹 6. **`df.withColumn()` for Adding or Modifying Columns**
```python
df.withColumn("bonus", col("salary") * 0.1)
```
- Adds or replaces a column.
- Often used in pipelines.

---

### 🔹 7. **`df["colName"].alias("newName")`**
```python
df.select(df["salary"].alias("updated_salary"))
```
- Combines bracket notation with aliasing.
- Handy in joins or renaming.

---

### 🔹 8. **SQL via Temp View**
```python
df.createOrReplaceTempView("Employee")
spark.sql("SELECT name, salary FROM Employee WHERE salary > 50000")
```
- Ideal for SQL lovers.
- Column access via SQL syntax.

---

### 🔹 9. **Nested Struct Columns**
```python
df.select("employee.name", "employee.salary")
col("employee.name")
```
- For accessing fields inside structs.
- Use dot notation or `col()`.

---

### 🧪 Bonus: Column Functions
You can apply transformations directly:
```python
col("name").substr(1, 3)
col("salary").cast("double")
col("name").like("J%")
```

---

Would you like this wrapped into a visual cheat sheet or added to your DataGym onboarding module? I can also generate examples for each style using your favorite dataset.
```
