# 5.1 Join in spark

**Spark DataFrame** supports all basic SQL Join Types like **INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN**. Spark SQL Joins are **wider transformations that result in data shuffling over the network**, hence they have huge performance issues when not designed with care.

Thanks to DataFrames & Dataset, Spark SQL **Joins comes with more optimization by default**. However we still need to pay more attention when we use join to avoid performance issues.

In this tutorial, you will learn different Join syntaxes and using different Join types on two DataFrames and Datasets using python examples.

We will talk about **Join on Multiple DataFrames** in the next section (5.2 JoinOnMultipleDataFrame).

# 5.2 The Join function



pyspark.sql.DataFrame.join(otherDf, onCond=None,how=None) : It joins one df with another. It has three parameters:

- otherDf: a DataFrame which is on the right side of the join


- onCond: a condition of the join. It can be a str, list or Column, this parameter is optional. If it's a string or a list of strings indicating the column name, the columns must exist in both data frame, and it performs an equi-join. If the column name is different on two sides, we can use "df1.col1==df2.col2" to match the columns. If multi columns are involved, we can use list such as "[df1.age1==df2.age2, df1.name1==df2.name2]" 
    
- how: is the join type, default value is "inner", it has string type and optional. The value must be one of: **inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.**


In [1]:
from pyspark.sql import SparkSession,DataFrame
from pyspark.sql.functions import col
import os

In [2]:
local=True

if local:
    spark=SparkSession.builder.master("local[4]").appName("JoinOnTwoDataFrame").getOrCreate()
else:
    spark=SparkSession.builder \
                      .master("k8s://https://kubernetes.default.svc:443") \
                      .appName("JoinOnTwoDataFrame") \
                      .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master") \
                      .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
                      .config("spark.executor.instances", "4") \
                      .config("spark.executor.memory","8g") \
                      .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
                      .getOrCreate()

23/12/12 17:08:48 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
23/12/12 17:08:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/12/12 17:08:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
emp = [(1, "Smith", -1, "2018", "10", "M", 3000),
           (2, "Rose", 1, "2010", "20", "M", 4000),
           (3, "Williams", 1, "2018", "21", "M", 1000),
           (4, "Jones", 2, "2005", "31", "F", 2000),
           (5, "Brown", 2, "2010", "30", "", -1),
           (6, "Foobar", 2, "2010", "150", "", -1)
           ]
emp_col_names = ["emp_id", "name", "superior_emp_id", "dept_creation_year",
                     "emp_dept_id", "gender", "salary"]
emp_df = spark.createDataFrame(data=emp, schema=emp_col_names)
emp_df.printSchema()
emp_df.show(truncate=False)
    

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superior_emp_id: long (nullable = true)
 |-- dept_creation_year: string (nullable = true)
 |-- emp_dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)


                                                                                

+------+--------+---------------+------------------+-----------+------+------+
|emp_id|name    |superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|
+------+--------+---------------+------------------+-----------+------+------+
|1     |Smith   |-1             |2018              |10         |M     |3000  |
|2     |Rose    |1              |2010              |20         |M     |4000  |
|3     |Williams|1              |2018              |21         |M     |1000  |
|4     |Jones   |2              |2005              |31         |F     |2000  |
|5     |Brown   |2              |2010              |30         |      |-1    |
|6     |Foobar  |2              |2010              |150        |      |-1    |
+------+--------+---------------+------------------+-----------+------+------+


In [4]:
dept = [("Finance", 10, "2018"),
            ("Marketing_US", 20, "2010"),
            ("Marketing_FR", 21, "2018"),
            ("Sales_US", 30, "2005"),
            ("Sales_FR", 31, "2010"),
            ("IT", 50, "2005")
            ]

dept_col_name = ["dept_name", "dept_id", "dept_creation_year"]
dept_df = spark.createDataFrame(data=dept, schema=dept_col_name)
dept_df.printSchema()
dept_df.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)
 |-- dept_creation_year: string (nullable = true)
+------------+-------+------------------+
|dept_name   |dept_id|dept_creation_year|
+------------+-------+------------------+
|Finance     |10     |2018              |
|Marketing_US|20     |2010              |
|Marketing_FR|21     |2018              |
|Sales_US    |30     |2005              |
|Sales_FR    |31     |2010              |
|IT          |50     |2005              |
+------------+-------+------------------+


# 5.3 Join examples

We will use example to illustrate different join type and their syntaxes in python.

## 5.3.1 Inner join with different join column name
In this example, we inner join two dataframes on a column, note that the name of the joining column are different for the two dataframe. So no confusion.

Note after the join, the joining column of the two dataframe are both in the result dataframe. On can be removed to free some space, if your dataset is too big.

In [5]:
emp_df.join(dept_df, emp_df.emp_dept_id ==dept_df.dept_id).show()



+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     2|    Rose|              1|              2010|         20|     M|  4000|Marketing_US|     20|              2010|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2010|         30|      |    -1|    Sales_US|     30|              2005|
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
+------+--------+---------------+---------------

                                                                                

## 5.3.2 Inner join with the same join column name

In this example, we will inner join two dataframes on a column, that has the same column name("dept_id") on the two dataframe.

Try the bad example, you will see the result has two column called "dept_id". If you try to select the "dept_id" column, you will receive an error message "Reference 'dept_id' is ambiguous,"



In [6]:
# change the column name from "emp_dept_id" to "dept_id"
emp_df_dup=emp_df.withColumnRenamed("emp_dept_id","dept_id")
emp_df_dup.show()

+------+--------+---------------+------------------+-------+------+------+
|emp_id|    name|superior_emp_id|dept_creation_year|dept_id|gender|salary|
+------+--------+---------------+------------------+-------+------+------+
|     1|   Smith|             -1|              2018|     10|     M|  3000|
|     2|    Rose|              1|              2010|     20|     M|  4000|
|     3|Williams|              1|              2018|     21|     M|  1000|
|     4|   Jones|              2|              2005|     31|     F|  2000|
|     5|   Brown|              2|              2010|     30|      |    -1|
|     6|  Foobar|              2|              2010|    150|      |    -1|
+------+--------+---------------+------------------+-------+------+------+


In [7]:
# note we have two column called "dept_id" in the result dataframe
bad_df=emp_df_dup.join(dept_df,emp_df_dup.dept_id==dept_df.dept_id,"inner")
bad_df.show()

+------+--------+---------------+------------------+-------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-------+------+------+------------+-------+------------------+
|     1|   Smith|             -1|              2018|     10|     M|  3000|     Finance|     10|              2018|
|     2|    Rose|              1|              2010|     20|     M|  4000|Marketing_US|     20|              2010|
|     3|Williams|              1|              2018|     21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2010|     30|      |    -1|    Sales_US|     30|              2005|
|     4|   Jones|              2|              2005|     31|     F|  2000|    Sales_FR|     31|              2010|
+------+--------+---------------+------------------+-------+------+------+------

In [13]:
# if we select the dept_id column, we get an error
bad_df.select("dept_id").show()

AnalysisException: Reference 'dept_id' is ambiguous, could be: dept_id, dept_id.

**To avoid the above error**, we can ask spark to create only one column after inner join. The simplest solution is that we just use the directly the column name which are shared by the two dataframes. 

In [16]:
# note, we only have one dept_id column in the result dataframe
good_df=emp_df_dup.join(dept_df,"dept_id","inner")
good_df.show()

+-------+------+--------+---------------+------------------+------+------+------------+------------------+
|dept_id|emp_id|    name|superior_emp_id|dept_creation_year|gender|salary|   dept_name|dept_creation_year|
+-------+------+--------+---------------+------------------+------+------+------------+------------------+
|     31|     4|   Jones|              2|              2005|     F|  2000|    Sales_FR|              2010|
|     10|     1|   Smith|             -1|              2018|     M|  3000|     Finance|              2018|
|     21|     3|Williams|              1|              2018|     M|  1000|Marketing_FR|              2018|
|     30|     5|   Brown|              2|              2010|      |    -1|    Sales_US|              2005|
|     20|     2|    Rose|              1|              2010|     M|  4000|Marketing_US|              2010|
+-------+------+--------+---------------+------------------+------+------+------------+------------------+


In [17]:
good_df.select("dept_id").show()

+-------+
|dept_id|
+-------+
|     31|
|     10|
|     21|
|     30|
|     20|
+-------+


## 5.3.3 Inner join on multiple column 

which has same column name

In this example, we inner join two dataframes on multiple column, which are "dept_id" and "dept_creation_year". I intentionally introduced three error in the emp_df, you can notice the last three row, the year and dept_id does not match any rows in the dept_df. So the join only returns 3 rows.

Important note, in the **cond list, we separate two condition with "," and this is considered as an "and". If you want to express "and" explicitly, use "&" instead of ",". To express "or", use "|".**  

In [18]:
# we use an implicit and, note we have duplicated column name 
emp_df.join(dept_df,[emp_df.emp_dept_id == dept_df.dept_id,emp_df.dept_creation_year == dept_df.dept_creation_year], "inner").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|     2|    Rose|              1|              2010|         20|     M|  4000|Marketing_US|     20|              2010|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+


In [20]:
# we use an explicite and
# note condition must be in () to connect with &
emp_df.join(dept_df,[(emp_df.emp_dept_id == dept_df.dept_id) & (emp_df.dept_creation_year == dept_df.dept_creation_year)], "inner").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|     2|    Rose|              1|              2010|         20|     M|  4000|Marketing_US|     20|              2010|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+


In [22]:
# we use an explicite or
emp_df.join(dept_df,[(emp_df.emp_dept_id == dept_df.dept_id) | (emp_df.dept_creation_year == dept_df.dept_creation_year)], "inner").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     1|   Smith|             -1|              2018|         10|     M|  3000|Marketing_FR|     21|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|     Finance|     10|              2018|
|     2|    Rose|              1|              2010|         20|     M|  4000|Marketing_US|     20|              2010|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     2|    Rose|              1|              2

In [24]:
# this cond does not work, we can't mix the two mode together in one condition
cond_bad = [emp_df.emp_dept_id == dept_df.dept_id, "dept_creation_year"]
emp_df_dup.join(dept_df, cond_bad, "inner").show(truncate=False)    

Py4JError: An error occurred while calling o135.and. Trace:
py4j.Py4JException: Method and([class java.lang.String]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



In [25]:
cond_good = ["dept_id", "dept_creation_year"]
# as the two dataframe has two duplicated column name "dept_id" and "dept_creation_year", we can use cond_good.
emp_df_dup.join(dept_df, cond_good, "inner").show(truncate=False)

+-------+------------------+------+--------+---------------+------+------+------------+
|dept_id|dept_creation_year|emp_id|name    |superior_emp_id|gender|salary|dept_name   |
+-------+------------------+------+--------+---------------+------+------+------------+
|20     |2010              |2     |Rose    |1              |M     |4000  |Marketing_US|
|21     |2018              |3     |Williams|1              |M     |1000  |Marketing_FR|
|10     |2018              |1     |Smith   |-1             |M     |3000  |Finance     |
+-------+------------------+------+--------+---------------+------+------+------------+


## 5.3.4 Outer Join(a.k.a full, fullouter join) 
The **outer join returns all rows from both datasets, where join expression does not match it returns null on respective record columns**.

note : you will find the output df has two rows contains null,
- row 1: dept_id = 50, where the emp_dept_id does not have this value
- row 2: emp_dept_id = 150, where the dept_id does not have this value 

In [26]:
emp_df.join(dept_df,emp_df.emp_dept_id==dept_df.dept_id, "outer").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|  null|    null|           null|              null|       null|  null|  null|          IT|     50|              2005|
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
|     6|  Foobar|              2|              2010|        150|      |    -1|        null|   null|              null|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2

In [27]:
emp_df.join(dept_df,emp_df.emp_dept_id==dept_df.dept_id, "full").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|  null|    null|           null|              null|       null|  null|  null|          IT|     50|              2005|
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
|     6|  Foobar|              2|              2010|        150|      |    -1|        null|   null|              null|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2

In [28]:
emp_df.join(dept_df,emp_df.emp_dept_id==dept_df.dept_id, "fullouter").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|  null|    null|           null|              null|       null|  null|  null|          IT|     50|              2005|
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
|     6|  Foobar|              2|              2010|        150|      |    -1|        null|   null|              null|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2

You can notice that the above 3 join type **(e.g. outer, full, fullouter) returns the same result**.

## 5.3.5 left outer join 
**Left (a.k.a Leftouter join) returns all rows from the left dataset, when match join expression does not match it assigns null on respective record columns.**



Note : you will find the output df has one row contains null,
- row 1: emp_dept_id = 150, where the dept_id does not have this value. Because in emp_df.join(), emp_df is consider as left side df, and dept_df is the right side. So all rows of left df are conserved, for those who does not have a match on right hand df, the right hand df columns are filled with null

For the rows of the right df who does not have a match on the left df, they are dropped (e.g. row with dept_id =50)

In [29]:
emp_df.join(dept_df, emp_df.emp_dept_id == dept_df.dept_id, "left").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
|     6|  Foobar|              2|              2010|        150|      |    -1|        null|   null|              null|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2010|         30|      |    -1|    Sales_US|     30|              2005|
|     2|    Rose|              1|              2

In [30]:
emp_df.join(dept_df, emp_df.emp_dept_id == dept_df.dept_id, "leftouter").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
|     6|  Foobar|              2|              2010|        150|      |    -1|        null|   null|              null|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2010|         30|      |    -1|    Sales_US|     30|              2005|
|     2|    Rose|              1|              2

You can notice that the above **two join type (e.g. left, leftouter) returns the same result**.

## 5.3.6 Right outer join 

**Right (a.k.a Rightouter join) returns all rows from the right dataset, when match join expression does not match it assigns null on respective record columns. For the rows of the left hand df who does not have a match on the right df, they are dropped** (e.g. row with emp_dept_id =50)

note : you will find the output df has one row contains null,
- row 1: dept_id = 50, where the emp_dept_id does not have this value. Because in emp_df.join(), emp_df is consider as left side df, and dept_df is the right side

So all rows of right hand (i.e. dept_df) df are conserved, for those who does not have a match on left hand df, the left hand df columns are filled with null


In [31]:
emp_df.join(dept_df, emp_df.emp_dept_id == dept_df.dept_id, "right").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|  null|    null|           null|              null|       null|  null|  null|          IT|     50|              2005|
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2010|         30|      |    -1|    Sales_US|     30|              2005|
|     2|    Rose|              1|              2

In [32]:
emp_df.join(dept_df, emp_df.emp_dept_id == dept_df.dept_id, "rightouter").show()

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|   dept_name|dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|  null|    null|           null|              null|       null|  null|  null|          IT|     50|              2005|
|     4|   Jones|              2|              2005|         31|     F|  2000|    Sales_FR|     31|              2010|
|     1|   Smith|             -1|              2018|         10|     M|  3000|     Finance|     10|              2018|
|     3|Williams|              1|              2018|         21|     M|  1000|Marketing_FR|     21|              2018|
|     5|   Brown|              2|              2010|         30|      |    -1|    Sales_US|     30|              2005|
|     2|    Rose|              1|              2

You can notice that the above **two join type (e.g. right, rightouter) returns the same result**.

## 5.3.7 Left Semi Join

**leftsemi join equals to a inner join and a select of the columns of left hand side df. As a result, all columns from the right dataset are ignored.** 

Note the result of below example, we only have the left hand side columns, and the row where emp_dep_id=150 is dropped, because there is no match no the right side.

In [33]:
emp_df.join(dept_df, emp_df.emp_dept_id == dept_df.dept_id, "leftsemi").show()

+------+--------+---------------+------------------+-----------+------+------+
|emp_id|    name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|
+------+--------+---------------+------------------+-----------+------+------+
|     4|   Jones|              2|              2005|         31|     F|  2000|
|     1|   Smith|             -1|              2018|         10|     M|  3000|
|     3|Williams|              1|              2018|         21|     M|  1000|
|     5|   Brown|              2|              2010|         30|      |    -1|
|     2|    Rose|              1|              2010|         20|     M|  4000|
+------+--------+---------------+------------------+-----------+------+------+


## 5.3.8 Left Anti Join
leftanti join does the opposite of the inner join and select of the columns of left hand side df. As a result, all columns from the right dataset are ignored. 

Note the result of below example, we only have the left hand side columns. We only had one row where emp_dep_id=150, because this row has no match no the right side. So the anti join will return this row. 

In [34]:
emp_df.join(dept_df, emp_df.emp_dept_id == dept_df.dept_id, "leftanti").show()

+------+------+---------------+------------------+-----------+------+------+
|emp_id|  name|superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|
+------+------+---------------+------------------+-----------+------+------+
|     6|Foobar|              2|              2010|        150|      |    -1|
+------+------+---------------+------------------+-----------+------+------+


## 5.3.9 SelfJoin
**All above join type can be applied to the same dataframe.** 

Below example shows an inner join on the same dataframe which use join condition superior_emp_id= emp_id to generate a table with his superior name.

In [39]:
emp_df.alias("emp1").join(emp_df.alias("emp2"), col("emp1.superior_emp_id") == col("emp2.emp_id"), "inner") \
        .select(col("emp1.emp_id").alias("emp_id"), col("emp1.name").alias("name"),
                col("emp2.emp_id").alias("superior_emp_id"), col("emp2.name").alias("superior_name")) \
        .show(truncate=False)

+------+--------+---------------+-------------+
|emp_id|name    |superior_emp_id|superior_name|
+------+--------+---------------+-------------+
|2     |Rose    |1              |Smith        |
|3     |Williams|1              |Smith        |
|4     |Jones   |2              |Rose         |
|5     |Brown   |2              |Rose         |
|6     |Foobar  |2              |Rose         |
+------+--------+---------------+-------------+


# 5.4 Pure SQL
We can also use pure sql to do joins


In [42]:
# create views based on the dataframe
emp_df.createOrReplaceTempView("EMP")
dept_df.createOrReplaceTempView("DEPT")

# use where to avoid join
joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id") \
        .show(truncate=False)

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|name    |superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|dept_name   |dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|4     |Jones   |2              |2005              |31         |F     |2000  |Sales_FR    |31     |2010              |
|1     |Smith   |-1             |2018              |10         |M     |3000  |Finance     |10     |2018              |
|3     |Williams|1              |2018              |21         |M     |1000  |Marketing_FR|21     |2018              |
|5     |Brown   |2              |2010              |30         |      |-1    |Sales_US    |30     |2005              |
|2     |Rose    |1              |2010              |20         |M     |4000  |Marketing_US|20     |2010              |
+------+--------+---------------+---------------

In [41]:
# inner join on dept_id column
joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id == d.dept_id") \
        .show(truncate=False)

+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|emp_id|name    |superior_emp_id|dept_creation_year|emp_dept_id|gender|salary|dept_name   |dept_id|dept_creation_year|
+------+--------+---------------+------------------+-----------+------+------+------------+-------+------------------+
|4     |Jones   |2              |2005              |31         |F     |2000  |Sales_FR    |31     |2010              |
|1     |Smith   |-1             |2018              |10         |M     |3000  |Finance     |10     |2018              |
|3     |Williams|1              |2018              |21         |M     |1000  |Marketing_FR|21     |2018              |
|5     |Brown   |2              |2010              |30         |      |-1    |Sales_US    |30     |2005              |
|2     |Rose    |1              |2010              |20         |M     |4000  |Marketing_US|20     |2010              |
+------+--------+---------------+---------------