# Top Three Salaries  
**FAANG SQL Interview Question**

---

### Question  
As part of an ongoing analysis of salary distribution within the company, your manager has requested a report identifying high earners in each department. A 'high earner' within a department is defined as an employee with a salary ranking among the top three salaries within that department.

You're tasked with identifying these high earners across all departments. Write a query to display the employee's name along with their department name and salary. In case of duplicates, sort the results of department name in ascending order, then by salary in descending order. If multiple employees have the same salary, then order them alphabetically.

> **Note:** Ensure to utilize the appropriate ranking window function to handle duplicate salaries effectively.

> As of June 18th, we have removed the requirement for unique salaries and revised the sorting order for the results.

---

### employee Schema:
| column_name   | type     | description                           |
|---------------|----------|---------------------------------------|
| employee_id   | integer  | The unique ID of the employee.        |
| name          | string   | The name of the employee.             |
| salary        | integer  | The salary of the employee.           |
| department_id | integer  | The department ID of the employee.    |
| manager_id    | integer  | The manager ID of the employee.       |

---

### department Schema:
| column_name     | type     | description                          |
|------------------|----------|--------------------------------------|
| department_id    | integer  | The department ID of the employee.   |
| department_name  | string   | The name of the department.          |

---

### employee Example Input:
| employee_id | name              | salary | department_id | manager_id |
|-------------|-------------------|--------|----------------|------------|
| 1           | Emma Thompson     | 3800   | 1              | 6          |
| 2           | Daniel Rodriguez  | 2230   | 1              | 7          |
| 3           | Olivia Smith      | 2000   | 1              | 8          |
| 4           | Noah Johnson      | 6800   | 2              | 9          |
| 5           | Sophia Martinez   | 1750   | 1              | 11         |
| 6           | Liam Brown        | 13000  | 3              | null       |
| 7           | Ava Garcia        | 12500  | 3              | null       |
| 8           | William Davis     | 6800   | 2              | null       |
| 9           | Isabella Wilson   | 11000  | 3              | null       |
| 10          | James Anderson    | 4000   | 1              | 11         |

---

### department Example Input:
| department_id | department_name |
|----------------|------------------|
| 1              | Data Analytics   |
| 2              | Data Science     |

---

### Example Output:
| department_name | name             | salary |
|------------------|------------------|--------|
| Data Analytics   | James Anderson   | 4000   |
| Data Analytics   | Emma Thompson    | 3800   |
| Data Analytics   | Daniel Rodriguez | 2230   |
| Data Science     | Noah Johnson     | 6800   |
| Data Science     | William Davis    | 6800   |

---

### Explanation  
The output displays the high earners in each department.

- In the **Data Analytics** department, James Anderson leads with a salary of $4,000, followed by Emma Thompson earning $3,800, and Daniel Rodriguez with $2,230.  
- In the **Data Science** department, both Noah Johnson and William Davis earn $6,800, with Noah listed before William due to alphabetical ordering.
}

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[1]').getOrCreate()

employee_data = [
    (1, "Emma Thompson", 3800, 1, 6),
    (2, "Daniel Rodriguez", 2230, 1, 7),
    (3, "Olivia Smith", 2000, 1, 8),
    (4, "Noah Johnson", 6800, 2, 9),
    (5, "Sophia Martinez", 1750, 1, 11),
    (6, "Liam Brown", 13000, 3, None),
    (7, "Ava Garcia", 12500, 3, None),
    (8, "William Davis", 6800, 2, None),
    (9, "Isabella Wilson", 11000, 3, None),
    (10, "James Anderson", 4000, 1, 11)
]

employee_schema = StructType([
    StructField("employee_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("salary", IntegerType(), True),
    StructField("department_id", IntegerType(), True),
    StructField("manager_id", IntegerType(), True)
])

employee_df = spark.createDataFrame(employee_data, schema=employee_schema)

department_data = [
    (1, "Data Analytics"),
    (2, "Data Science")
]

department_schema = StructType([
    StructField("department_id", IntegerType(), True),
    StructField("department_name", StringType(), True)
])

department_df = spark.createDataFrame(department_data, schema=department_schema)


In [2]:
employee_df.show()
department_df.show()

+-----------+----------------+------+-------------+----------+
|employee_id|            name|salary|department_id|manager_id|
+-----------+----------------+------+-------------+----------+
|          1|   Emma Thompson|  3800|            1|         6|
|          2|Daniel Rodriguez|  2230|            1|         7|
|          3|    Olivia Smith|  2000|            1|         8|
|          4|    Noah Johnson|  6800|            2|         9|
|          5| Sophia Martinez|  1750|            1|        11|
|          6|      Liam Brown| 13000|            3|      NULL|
|          7|      Ava Garcia| 12500|            3|      NULL|
|          8|   William Davis|  6800|            2|      NULL|
|          9| Isabella Wilson| 11000|            3|      NULL|
|         10|  James Anderson|  4000|            1|        11|
+-----------+----------------+------+-------------+----------+

+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
|            1| Data 

In [3]:
from pyspark.sql.window import Window

windspec=Window.partitionBy('department_name').orderBy(col('salary').desc())

employee_df.join(department_df,['department_id'])\
           .withColumn('rnk',dense_rank().over(windspec))\
           .where('rnk<4')\
           .orderBy('department_name', 'name', 'salary',ascending=[1,0,1])\
           .select('department_name', 'name', 'salary').show()

+---------------+----------------+------+
|department_name|            name|salary|
+---------------+----------------+------+
| Data Analytics|  James Anderson|  4000|
| Data Analytics|   Emma Thompson|  3800|
| Data Analytics|Daniel Rodriguez|  2230|
|   Data Science|   William Davis|  6800|
|   Data Science|    Noah Johnson|  6800|
+---------------+----------------+------+



In [4]:
employee_df.createOrReplaceTempView('employee')
department_df.createOrReplaceTempView('department')

spark.sql(
'''
with cte as 
(SELECT 
department_name,
name,
salary,
dense_rank() OVER(PARTITION BY department_name ORDER BY salary DESC) as rnk
FROM employee JOIN department
USING(department_id))

SELECT department_name, name, salary 
from cte
WHERE rnk<=3
ORDER BY department_name,salary DESC,name''').show()

+---------------+----------------+------+
|department_name|            name|salary|
+---------------+----------------+------+
| Data Analytics|  James Anderson|  4000|
| Data Analytics|   Emma Thompson|  3800|
| Data Analytics|Daniel Rodriguez|  2230|
|   Data Science|    Noah Johnson|  6800|
|   Data Science|   William Davis|  6800|
+---------------+----------------+------+

