## Join Syntax
PySpark SQL join has a below syntax and it can be accessed directly from DataFrame.

    join(self, other, on=None, how=None)

       * other: Right side of the join
       * on: a string for the join column name
       * how: default `inner`. Must be one of `inner`, `cross`, `outer`,`full`, `full_outer`, `left`, `left_outer`, `right`, `right_outer`,`left_semi`, and `left_anti`.

You can also write Join expression by adding where() and filter() methods on DataFrame and can have Join on multiple columns.
`join()` operation takes parameters as below and returns DataFrame.


### PySpark Join Types

| Join String | Equivalent SQL Join |
| :- | -: |
|inner	| INNER JOIN |
|outer, full, fullouter, full_outer	| FULL OUTER JOIN |
|left, leftouter, left_outer |	LEFT JOIN |
|right, rightouter, right_outer	| RIGHT JOIN |
|cross	 | |
|anti, leftanti, left_anti	 | |
|semi, leftsemi, left_semi	 | |

In [21]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from termcolor import cprint 

spark = SparkSession.builder.appName('join').getOrCreate()

let’s create an `emp` and `dept` DataFrames. here, column `emp_id` is unique on emp and `dept_id` is unique on the dept dataset’s and emp_dept_id from emp has a reference to dept_id on dept dataset.

In [22]:
emp = [(1,"Smith",-1,"2018","10","M",3000), \
       (2,"Rose",1,"2010","20","M",4000), \
       (3,"Williams",1,"2010","10","M",1000), \
       (4,"Jones",2,"2005","10","F",2000), \
       (5,"Brown",2,"2010","40","",-1), \
       (6,"Brown",2,"2010","50","",-1) \
     ]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
              "emp_dept_id","gender","salary"]

empDF = spark.createDataFrame(data=emp, schema = empColumns)
empDF.printSchema()
cprint("--- Emp Dataset", "blue")
empDF.show(truncate=False)

dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
cprint("--- Dept Dataset", "blue")
deptDF.show(truncate=False)

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superior_emp_id: long (nullable = true)
 |-- year_joined: string (nullable = true)
 |-- emp_dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

[34m--- Emp Dataset[0m
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------

### Inner Join DataFrame
`Inner` join is the default join in PySpark and it’s mostly used. This joins two datasets on key columns, where keys don’t match the rows get dropped from both datasets (`emp` & `dept`).

When we apply Inner join on our datasets, It drops `emp_dept_id` 50 from `emp` and `dept_id` 30 from `dept` datasets

In [23]:
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"inner") \
     .show(truncate=False)

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+



                                                                                

### Full Outer Join
`Outer` a.k.a `full`, `fullouter` join returns all rows from both datasets, where join expression doesn’t match it returns null on respective record columns.

From our `emp` dataset’s `emp_dept_id` with value 50 doesn’t have a record on `dept` hence dept columns have null and `dept_id` 30 doesn’t have a record in `emp` hence you see null’s on emp columns. Below is the result of the above Join expression.


In [24]:
cprint('--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"outer")', "red")
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"outer").show(truncate=False)
cprint('--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"full")', "red")
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"full").show(truncate=False)
cprint('--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"fullouter")', "red")
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"fullouter").show(truncate=False)

[31m--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"outer")[0m


                                                                                

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|NULL  |NULL    |NULL           |NULL       |NULL       |NULL  |NULL  |Sales    |30     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |NULL     |NULL   |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

[31m--- 

                                                                                

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|NULL  |NULL    |NULL           |NULL       |NULL       |NULL  |NULL  |Sales    |30     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |NULL     |NULL   |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

[31m--- 

                                                                                

### Left Outer Join
`Left` a.k.a `Leftouter` join returns all rows from the left dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.

From our dataset, `emp_dept_id` 50 doesn’t have a record on `dept` dataset hence, this record contains null on `dept` columns (dept_name & dept_id). and `dept_id` 30 from `dept` dataset dropped from the results. Below is the result of the above Join expression.

In [25]:
cprint('--- empDF.join(deptDF,emp_dept_id ==  deptDF.dept_id,"left")', "red")
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"left").show(truncate=False)
cprint('--- empDF.join(deptDF,emp_dept_id ==  deptDF.dept_id,"leftouter")', "red")
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftouter").show(truncate=False)

[31m--- empDF.join(deptDF,emp_dept_id ==  deptDF.dept_id,"left")[0m
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |NULL     |NULL   |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

[31m--- empDF.join(deptDF,em

                                                                                

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |NULL     |NULL   |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+



### Right Outer Join
`Right` a.k.a `Rightouter` join is opposite of `left` join, here it returns all rows from the right dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found.

the right dataset `dept_id` 30 doesn’t have it on the left dataset `emp` hence, this record contains null on `emp` columns. and `emp_dept_id` 50 dropped as a match not found on left. Below is the result of the above Join expression.

In [26]:
cprint('--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"right")', "red")
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"right").show(truncate=False)
cprint('--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"rightouter")', "red")
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"rightouter").show(truncate=False)

[31m--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"right")[0m


                                                                                

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|NULL  |NULL    |NULL           |NULL       |NULL       |NULL  |NULL  |Sales    |30     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

[31m--- empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"rightouter")[0m
+------+--------+

### Left Semi Join
`leftsemi` join is similar to `inner` join, the difference being `leftsemi` join returns all columns from the left dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets.

The same result can be achieved using select on the result of the inner join however, using this join would be efficient.

In [27]:
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftsemi").show(truncate=False)

+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+



### Left Anti Join
`leftanti` join does the exact opposite of the `leftsemi`, `leftanti` join returns only columns from the left dataset for non-matched records.



In [28]:
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftanti").show(truncate=False)

+------+-----+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
|6     |Brown|2              |2010       |50         |      |-1    |
+------+-----+---------------+-----------+-----------+------+------+



                                                                                

### Self Join
Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-explained join types to join DataFrame to itself. below example use `inner` self join.

Here, we are joining `emp` dataset with itself to find out superior `emp_id` and `name` for all employees.

In [29]:
empDF.alias("emp1").join(empDF.alias("emp2"), col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \
    .select(col("emp1.emp_id"),col("emp1.name"), col("emp2.emp_id").alias("superior_emp_id"), \
            col("emp2.name").alias("superior_emp_name")) \
    .show(truncate=False)

+------+--------+---------------+-----------------+
|emp_id|name    |superior_emp_id|superior_emp_name|
+------+--------+---------------+-----------------+
|2     |Rose    |1              |Smith            |
|3     |Williams|1              |Smith            |
|4     |Jones   |2              |Rose             |
|5     |Brown   |2              |Rose             |
|6     |Brown   |2              |Rose             |
+------+--------+---------------+-----------------+



### Using SQL Expression
Since PySpark SQL support native SQL syntax, we can also write join operations after creating temporary tables on DataFrames and use these tables on `spark.sql()`.

In [30]:
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id").show(truncate=False)

joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id").show(truncate=False)

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+-

### SQL Join on multiple DataFrames
When you need to join more than two tables, you either use SQL expression after creating a temporary view on the DataFrame or use the result of join operation to join with another DataFrame like chaining them.
    
    df1.join(df2,df1.id1 == df2.id2,"inner").join(df3,df1.id1 == df3.id3,"inner")

### Finding difference between dataframes

In [31]:
current = [ (1,"Smith",-1,"2018","10","M",3000), \
            (2,"Rose",1,"2010","20","M",4000), \
            (3,"Williams",1,"2010","10","M",1000), \
            (4,"Jones",2,"2005","10","F",2000), \
            (5,"Brown",2,"2010","40","",-1), \
            (6,"Brown",2,"2010","50","",-1) \
            ]
currentColumns = ["id","name","superiorid","year_joined", "dept_id","gender","salary"]

currentDF = spark.createDataFrame(data=current, schema = currentColumns)
currentDF.printSchema()
cprint("--- currentDF Dataset", "blue")
currentDF.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superiorid: long (nullable = true)
 |-- year_joined: string (nullable = true)
 |-- dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

[34m--- currentDF Dataset[0m
+---+--------+----------+-----------+-------+------+------+
|id |name    |superiorid|year_joined|dept_id|gender|salary|
+---+--------+----------+-----------+-------+------+------+
|1  |Smith   |-1        |2018       |10     |M     |3000  |
|2  |Rose    |1         |2010       |20     |M     |4000  |
|3  |Williams|1         |2010       |10     |M     |1000  |
|4  |Jones   |2         |2005       |10     |F     |2000  |
|5  |Brown   |2         |2010       |40     |      |-1    |
|6  |Brown   |2         |2010       |50     |      |-1    |
+---+--------+----------+-----------+-------+------+------+



In [32]:
previous = [(1,"Smith",-1,"2018","10","M",3000), \
            (2,"Rose",1,"2010","20","M",4000), \
            (3,"Will",1,"2010","10","M",1000), \
            (4,"Jones",2,"2005","10","F",2500), \
            (5,"Brown",2,"2010","40","",-1), \
            (6,"Brown",2,"2010","50","",-1) \
            ]
previousColumns = ["id","name","superiorid","year_joined", "dept_id","gender","salary"]

previousDF = spark.createDataFrame(data=previous, schema = previousColumns)
previousDF.printSchema()
cprint("--- previousDF Dataset", "red")
previousDF.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superiorid: long (nullable = true)
 |-- year_joined: string (nullable = true)
 |-- dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

[31m--- previousDF Dataset[0m
+---+-----+----------+-----------+-------+------+------+
|id |name |superiorid|year_joined|dept_id|gender|salary|
+---+-----+----------+-----------+-------+------+------+
|1  |Smith|-1        |2018       |10     |M     |3000  |
|2  |Rose |1         |2010       |20     |M     |4000  |
|3  |Will |1         |2010       |10     |M     |1000  |
|4  |Jones|2         |2005       |10     |F     |2500  |
|5  |Brown|2         |2010       |40     |      |-1    |
|6  |Brown|2         |2010       |50     |      |-1    |
+---+-----+----------+-----------+-------+------+------+



In [33]:
df_test = currentDF.subtract(previousDF)
df_test.show(truncate=False)

+---+--------+----------+-----------+-------+------+------+
|id |name    |superiorid|year_joined|dept_id|gender|salary|
+---+--------+----------+-----------+-------+------+------+
|3  |Williams|1         |2010       |10     |M     |1000  |
|4  |Jones   |2         |2005       |10     |F     |2000  |
+---+--------+----------+-----------+-------+------+------+



In [34]:
df_test2 = previousDF.subtract(currentDF)
df_test2.show(truncate=False)

+---+-----+----------+-----------+-------+------+------+
|id |name |superiorid|year_joined|dept_id|gender|salary|
+---+-----+----------+-----------+-------+------+------+
|4  |Jones|2         |2005       |10     |F     |2500  |
|3  |Will |1         |2010       |10     |M     |1000  |
+---+-----+----------+-----------+-------+------+------+



In [35]:
df_test3 = previousDF.join(currentDF, on='name', how='left_anti')
df_test3.show(truncate=False)

+----+---+----------+-----------+-------+------+------+
|name|id |superiorid|year_joined|dept_id|gender|salary|
+----+---+----------+-----------+-------+------+------+
|Will|3  |1         |2010       |10     |M     |1000  |
+----+---+----------+-----------+-------+------+------+



                                                                                

In [36]:
df_test4 = previousDF.join(currentDF, on='salary', how='left_anti')
df_test4.show(truncate=False)

+------+---+-----+----------+-----------+-------+------+
|salary|id |name |superiorid|year_joined|dept_id|gender|
+------+---+-----+----------+-----------+-------+------+
|2500  |4  |Jones|2         |2005       |10     |F     |
+------+---+-----+----------+-----------+-------+------+



In [37]:
df_test5 = previousDF.exceptAll(currentDF)
df_test5.show(truncate=False)

+---+-----+----------+-----------+-------+------+------+
|id |name |superiorid|year_joined|dept_id|gender|salary|
+---+-----+----------+-----------+-------+------+------+
|3  |Will |1         |2010       |10     |M     |1000  |
|4  |Jones|2         |2005       |10     |F     |2500  |
+---+-----+----------+-----------+-------+------+------+



                                                                                

### Complex dataframes differences

In [38]:
current = [ (1,["Smith", "John"],-1,"2018","10","M",3000), \
            (2,["Rose", "Mary"],1,"2010","20","M",4000), \
            (3,["Williams", "Paul"],1,"2010","10","M",1000), \
            (4,["Jones", "Joe"],2,"2005","10","F",2000), \
            (5,["Brown", "Katie"],2,"2010","40","",-1), \
            (6,["Brown", "Justine"],2,"2010","50","",-1) \
            ]
currentColumns = ["id","full name","superiorid","year_joined", "dept_id","gender","salary"]

currentDF = spark.createDataFrame(data=current, schema = currentColumns)
currentDF.printSchema()
cprint("--- currentDF Dataset", "blue")
currentDF.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- full name: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- superiorid: long (nullable = true)
 |-- year_joined: string (nullable = true)
 |-- dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

[34m--- currentDF Dataset[0m
+---+----------------+----------+-----------+-------+------+------+
|id |full name       |superiorid|year_joined|dept_id|gender|salary|
+---+----------------+----------+-----------+-------+------+------+
|1  |[Smith, John]   |-1        |2018       |10     |M     |3000  |
|2  |[Rose, Mary]    |1         |2010       |20     |M     |4000  |
|3  |[Williams, Paul]|1         |2010       |10     |M     |1000  |
|4  |[Jones, Joe]    |2         |2005       |10     |F     |2000  |
|5  |[Brown, Katie]  |2         |2010       |40     |      |-1    |
|6  |[Brown, Justine]|2         |2010       |50     |      |-1    |
+---+----------------+----------+--

In [39]:
previous = [ (1,["Smith", "John"],-1,"2018","10","M",3000), \
            (2,["Rose", "Paul"],1,"2010","20","M",4000), \
            (3,["Williams", "Mary", "Fawcett"],1,"2010","10","M",1000), \
            (4,["Jones", "Joe"],2,"2005","10","F",2000), \
            (5,["Brown", "Katie"],2,"2010","40","",-1), \
            (6,["Brown", "Justine"],2,"2010","50","",-1) \
            ]
previousColumns = ["id","full name","superiorid","year_joined", "dept_id","gender","salary"]

previousDF = spark.createDataFrame(data=previous, schema = previousColumns)
cprint("--- previousDF Dataset", "blue")
previousDF.show(truncate=False)

[34m--- previousDF Dataset[0m
+---+-------------------------+----------+-----------+-------+------+------+
|id |full name                |superiorid|year_joined|dept_id|gender|salary|
+---+-------------------------+----------+-----------+-------+------+------+
|1  |[Smith, John]            |-1        |2018       |10     |M     |3000  |
|2  |[Rose, Paul]             |1         |2010       |20     |M     |4000  |
|3  |[Williams, Mary, Fawcett]|1         |2010       |10     |M     |1000  |
|4  |[Jones, Joe]             |2         |2005       |10     |F     |2000  |
|5  |[Brown, Katie]           |2         |2010       |40     |      |-1    |
|6  |[Brown, Justine]         |2         |2010       |50     |      |-1    |
+---+-------------------------+----------+-----------+-------+------+------+



In [40]:
df_test6 = previousDF.subtract(currentDF)
df_test6.show(truncate=False)

+---+-------------------------+----------+-----------+-------+------+------+
|id |full name                |superiorid|year_joined|dept_id|gender|salary|
+---+-------------------------+----------+-----------+-------+------+------+
|2  |[Rose, Paul]             |1         |2010       |20     |M     |4000  |
|3  |[Williams, Mary, Fawcett]|1         |2010       |10     |M     |1000  |
+---+-------------------------+----------+-----------+-------+------+------+

