## Left or Right Outer Join

Let us understand about left or right outer join using Spark.
* We get the data from both the data sets satisfying the join condition along with the data from the driving table which does not satisfy the join condition.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Joining Data Sets'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [2]:
spark.conf.set("spark.sql.shuffle.partitions", "2")

In [3]:
orders = spark.read.json('/public/retail_db_json/orders')

In [4]:
order_items = spark.read.json('/public/retail_db_json/order_items')

In [5]:
orders.printSchema()

root
 |-- order_customer_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: long (nullable = true)
 |-- order_status: string (nullable = true)



In [6]:
order_items.printSchema()

root
 |-- order_item_id: long (nullable = true)
 |-- order_item_order_id: long (nullable = true)
 |-- order_item_product_id: long (nullable = true)
 |-- order_item_product_price: double (nullable = true)
 |-- order_item_quantity: long (nullable = true)
 |-- order_item_subtotal: double (nullable = true)



In [7]:
help(orders.join)

Help on method join in module pyspark.sql.dataframe:

join(other, on=None, how=None) method of pyspark.sql.dataframe.DataFrame instance
    Joins with another :class:`DataFrame`, using the given join expression.
    
    :param other: Right side of the join
    :param on: a string for the join column name, a list of column names,
        a join expression (Column), or a list of Columns.
        If `on` is a string or a list of strings indicating the name of the join column(s),
        the column(s) must exist on both sides, and this performs an equi-join.
    :param how: str, default ``inner``. Must be one of: ``inner``, ``cross``, ``outer``,
        ``full``, ``full_outer``, ``left``, ``left_outer``, ``right``, ``right_outer``,
        ``left_semi``, and ``left_anti``.
    
    The following performs a full outer join between ``df1`` and ``df2``.
    
    >>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
    [Row(name=None, height=80), Row(name='Bob'

In [8]:
orders_join = orders.join(
    order_items, 
    on=orders['order_id'] == order_items['order_item_order_id'],
    how='inner'
)

In [9]:
orders_join.count()

172198

In [10]:
orders_outer = orders.join(
    order_items, 
    on=orders['order_id'] == order_items['order_item_order_id'],
    how='outer'
)

In [11]:
orders_outer.printSchema()

root
 |-- order_customer_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: long (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_item_id: long (nullable = true)
 |-- order_item_order_id: long (nullable = true)
 |-- order_item_product_id: long (nullable = true)
 |-- order_item_product_price: double (nullable = true)
 |-- order_item_quantity: long (nullable = true)
 |-- order_item_subtotal: double (nullable = true)



In [12]:
orders_outer.show()

+-----------------+--------------------+--------+---------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|order_customer_id|          order_date|order_id|   order_status|order_item_id|order_item_order_id|order_item_product_id|order_item_product_price|order_item_quantity|order_item_subtotal|
+-----------------+--------------------+--------+---------------+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|            2|                  2|                 1073|                  199.99|                  1|             199.99|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|            3|                  2|                  502|                    50.0|                  5|              250.0|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|

In [13]:
orders.count()

68883

In [14]:
order_items.count()

172198

In [15]:
orders_join.count()

172198

In [16]:
orders_outer.count()

183650

In [17]:
orders. \
    join(
        order_items, 
        on=orders['order_id'] == order_items['order_item_order_id'],
        how='outer'
    ). \
    select(orders['*'], order_items['order_item_id']). \
    show()

+-----------------+--------------------+--------+---------------+-------------+
|order_customer_id|          order_date|order_id|   order_status|order_item_id|
+-----------------+--------------------+--------+---------------+-------------+
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|            2|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|            3|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|            4|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|            5|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|            6|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|            7|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|            8|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|            9|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|           10|
|            11318|2013-07-25 00:00:...|

In [21]:
orders. \
    join(
        order_items, 
        on=orders['order_id'] == order_items['order_item_order_id'],
        how='outer'
    ). \
    filter(order_items['order_item_id'].isNull()). \
    select(orders['*'], order_items['order_item_id']). \
    show()

+-----------------+--------------------+--------+---------------+-------------+
|order_customer_id|          order_date|order_id|   order_status|order_item_id|
+-----------------+--------------------+--------+---------------+-------------+
|             7562|2013-07-25 00:00:...|      26|       COMPLETE|         null|
|            10628|2013-07-25 00:00:...|      54|PENDING_PAYMENT|         null|
|             2052|2013-07-25 00:00:...|      55|        PENDING|         null|
|             8365|2013-07-25 00:00:...|      60|PENDING_PAYMENT|         null|
|             7327|2013-07-25 00:00:...|      79|PENDING_PAYMENT|         null|
|             3566|2013-07-25 00:00:...|      82|PENDING_PAYMENT|         null|
|              824|2013-07-25 00:00:...|      89|        ON_HOLD|         null|
|             4611|2013-07-26 00:00:...|     125|PENDING_PAYMENT|         null|
|             2772|2013-07-26 00:00:...|     128|PENDING_PAYMENT|         null|
|             8876|2013-07-26 00:00:...|

In [22]:
orders. \
    join(
        order_items, 
        on=orders['order_id'] == order_items['order_item_order_id'],
        how='outer'
    ). \
    filter(order_items['order_item_id'].isNull()). \
    count()

11452

In [25]:
orders. \
    alias('o'). \
    join(
        order_items.alias('oi'), 
        on=orders['order_id'] == order_items['order_item_order_id'],
        how='outer'
    ). \
    filter('oi.order_item_id IS NULL'). \
    count()

11452

In [23]:
orders. \
    join(
        order_items, 
        on=orders['order_id'] == order_items['order_item_order_id'],
        how='outer'
    ). \
    filter(order_items['order_item_id'].isNotNull()). \
    count()

172198