# [1978. Employees Whose Manager Left the Company](https://leetcode.com/problems/employees-whose-manager-left-the-company/description/?envType=study-plan-v2&envId=top-sql-50)

Table: Employees

<pre>+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| employee_id | int      |
| name        | varchar  |
| manager_id  | int      |
| salary      | int      |
+-------------+----------+</pre>
In SQL, employee_id is the primary key for this table.
This table contains information about the employees, their salary, and the ID of their manager. Some employees do not have a manager (manager_id is null).


Find the IDs of the employees whose salary is strictly less than dollar 30000 and whose manager left the company. When a manager leaves the company, their information is deleted from the Employees table, but the reports still have their manager_id set to the manager that left.

Return the result table ordered by employee_id.

The result format is in the following example.



Example 1:

Input:  
Employees table:
<pre>+-------------+-----------+------------+--------+
| employee_id | name      | manager_id | salary |
+-------------+-----------+------------+--------+
| 3           | Mila      | 9          | 60301  |
| 12          | Antonella | null       | 31000  |
| 13          | Emery     | null       | 67084  |
| 1           | Kalel     | 11         | 21241  |
| 9           | Mikaela   | null       | 50937  |
| 11          | Joziah    | 6          | 28485  |
+-------------+-----------+------------+--------+</pre>
Output:
<pre>+-------------+
| employee_id |
+-------------+
| 11          |
+-------------+</pre>

Explanation:
The employees with a salary less than $30000 are 1 (Kalel) and 11 (Joziah).
Kalel's manager is employee 11, who is still in the company (Joziah).
Joziah's manager is employee 6, who left the company because there is no row for employee 6 as it was deleted.

In [1]:
#pandas schema
import pandas as pd

data = [[3, 'Mila', 9, 60301], [12, 'Antonella', None, 31000], [13, 'Emery', None, 67084], [1, 'Kalel', 11, 21241],
        [9, 'Mikaela', None, 50937], [11, 'Joziah', 6, 28485]]
employees = pd.DataFrame(data, columns=['employee_id', 'name', 'manager_id', 'salary']).astype(
    {'employee_id': 'Int64', 'name': 'object', 'manager_id': 'Int64', 'salary': 'Int64'})

# spark has issues with null in Int64 of pandas, hence converting to str for now. Will convert to int after loading in spark.
employees['manager_id'] = employees['manager_id'].astype('str')


#pyspark schema

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()

employees_df = spark.createDataFrame(employees)
employees_df.withColumn('manager_id', col('manager_id').cast('int')).show()

+-----------+---------+----------+------+
|employee_id|     name|manager_id|salary|
+-----------+---------+----------+------+
|          3|     Mila|         9| 60301|
|         12|Antonella|      NULL| 31000|
|         13|    Emery|      NULL| 67084|
|          1|    Kalel|        11| 21241|
|          9|  Mikaela|      NULL| 50937|
|         11|   Joziah|         6| 28485|
+-----------+---------+----------+------+



In [2]:
employees_df.printSchema()

root
 |-- employee_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- manager_id: string (nullable = true)
 |-- salary: long (nullable = true)



In [16]:
# Solving in pyspark dataframe
subqery= employees_df.select('employee_id').rdd.map(lambda x: x[0]).collect()  #alternative way of doing it
subqery= [row.employee_id for row in employees_df.collect()]

employees_df\
    .where( (col('salary') < 30000) & (~col('manager_id').isin(subqery)) )\
    .select('employee_id')\
    .orderBy('employee_id')\
    .show()

+-----------+
|employee_id|
+-----------+
|         11|
+-----------+



In [35]:
# Solving in pyspark dataframe using anti join

employees_df.alias('a')\
    .filter(col('salary') < 30000)\
    .join(employees_df.alias('b'), col('b.employee_id') == col('a.manager_id'), 'anti')\
    .select('a.employee_id')\
    .orderBy('a.employee_id')\
    .show()

+-----------+
|employee_id|
+-----------+
|         11|
+-----------+



In [3]:
# In Spark SQL

employees_df.createOrReplaceTempView('employees')

spark.sql('''
SELECT employee_id
from employees
where salary < 30000
and manager_id not in (select employee_id from employees)
order by employee_id
''').show()

+-----------+
|employee_id|
+-----------+
|         11|
+-----------+



In [None]:
spark.stop()