## Filtering Data

Let us understand how we can filter the data in Spark SQL.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [None]:
val username = System.getProperty("user.name")

In [1]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

username = itv002480
spark = org.apache.spark.sql.SparkSession@271493db


org.apache.spark.sql.SparkSession@271493db

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* We use `WHERE` clause to filter the data.
* All comparison operators such as `=`, `!=`, `>`, `<`, etc can be used to compare a column or expression or literal with another column or expression or literal.
* We can use operators such as LIKE with % and regexp_like for pattern matching.
* Boolan OR and AND can be performed when we want to apply multiple conditions.
  * Get all orders with order_status equals to COMPLETE or CLOSED. We can also use IN operator.
  * Get all orders from month 2014 January with order_status equals to COMPLETE or CLOSED
* We need to use `IS NULL` and `IS NOT NULL` to compare against null values.

In [2]:
%%sql

USE itv002480_retail

Waiting for a Spark session to start...

++
||
++
++



In [3]:
%%sql

SHOW tables

+----------------+-----------+-----------+
|        database|  tableName|isTemporary|
+----------------+-----------+-----------+
|itv002480_retail|order_items|      false|
|itv002480_retail|     orders|      false|
+----------------+-----------+-----------+



In [4]:
%%sql

SELECT * FROM orders WHERE order_status = 'COMPLETE' LIMIT 10

|   34592|2014-02-23 00:...


+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|   34565|2014-02-23 00:00:...|             8702|    COMPLETE|
|   34568|2014-02-23 00:00:...|             1271|    COMPLETE|
|   34569|2014-02-23 00:00:...|            11083|    COMPLETE|
|   34576|2014-02-23 00:00:...|             6117|    COMPLETE|
|   34580|2014-02-23 00:00:...|             6540|    COMPLETE|
|   34581|2014-02-23 00:00:...|             4882|    COMPLETE|
|   34589|2014-02-23 00:00:...|               42|    COMPLETE|
|   34590|2014-02-23 00:00:...|            10367|    COMPLETE|
|   34592|2014-02-23 00:00:...|             4033|    COMPLETE|
|   34593|2014-02-23 00:00:...|             4696|    COMPLETE|
+--------+--------------------+-----------------+------------+



In [5]:
%%sql

SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'

+--------+
|count(1)|
+--------+
|   22899|
+--------+



In [6]:
%%sql

SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED') LIMIT 10

|   34589|2014-02-23 00:...


+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|   34565|2014-02-23 00:00:...|             8702|    COMPLETE|
|   34568|2014-02-23 00:00:...|             1271|    COMPLETE|
|   34569|2014-02-23 00:00:...|            11083|    COMPLETE|
|   34570|2014-02-23 00:00:...|             3159|      CLOSED|
|   34571|2014-02-23 00:00:...|             4551|      CLOSED|
|   34576|2014-02-23 00:00:...|             6117|    COMPLETE|
|   34580|2014-02-23 00:00:...|             6540|    COMPLETE|
|   34581|2014-02-23 00:00:...|             4882|    COMPLETE|
|   34589|2014-02-23 00:00:...|               42|    COMPLETE|
|   34590|2014-02-23 00:00:...|            10367|    COMPLETE|
+--------+--------------------+-----------------+------------+



In [7]:
%%sql

SELECT * FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED' LIMIT 10

|      17|2013-07-25 00:...


+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|       1|2013-07-25 00:00:...|            11599|      CLOSED|
|       3|2013-07-25 00:00:...|            12111|    COMPLETE|
|       4|2013-07-25 00:00:...|             8827|      CLOSED|
|       5|2013-07-25 00:00:...|            11318|    COMPLETE|
|       6|2013-07-25 00:00:...|             7130|    COMPLETE|
|       7|2013-07-25 00:00:...|             4530|    COMPLETE|
|      12|2013-07-25 00:00:...|             1837|      CLOSED|
|      15|2013-07-25 00:00:...|             2568|    COMPLETE|
|      17|2013-07-25 00:00:...|             2667|    COMPLETE|
|      18|2013-07-25 00:00:...|             1205|      CLOSED|
+--------+--------------------+-----------------+------------+



In [8]:
%%sql

SELECT count(1) FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')

+--------+
|count(1)|
+--------+
|   30455|
+--------+



In [9]:
%%sql

SELECT count(1) FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'

+--------+
|count(1)|
+--------+
|   30455|
+--------+



In [10]:
%%sql

SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
LIMIT 10

|   61920|2014-01-02 00:...


+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|   61898|2014-01-01 00:00:...|            10152|    COMPLETE|
|   61904|2014-01-01 00:00:...|              470|    COMPLETE|
|   61907|2014-01-01 00:00:...|            11258|    COMPLETE|
|   61908|2014-01-01 00:00:...|             5031|    COMPLETE|
|   61910|2014-01-01 00:00:...|            11201|    COMPLETE|
|   61913|2014-01-01 00:00:...|            12218|    COMPLETE|
|   61914|2014-01-01 00:00:...|             2956|    COMPLETE|
|   61919|2014-01-02 00:00:...|            12383|    COMPLETE|
|   61920|2014-01-02 00:00:...|             2278|    COMPLETE|
|   61921|2014-01-02 00:00:...|             3530|    COMPLETE|
+--------+--------------------+-----------------+------------+



In [11]:
%%sql

SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'

+--------+
|count(1)|
+--------+
|    2544|
+--------+



In [12]:
%%sql

SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
LIMIT 10

|   25900|2014-01-01 00:...


+--------+--------------------+-----------------+------------+
|order_id|          order_date|order_customer_id|order_status|
+--------+--------------------+-----------------+------------+
|   25882|2014-01-01 00:00:...|             4598|    COMPLETE|
|   25888|2014-01-01 00:00:...|             6735|    COMPLETE|
|   25889|2014-01-01 00:00:...|            10045|    COMPLETE|
|   25891|2014-01-01 00:00:...|             3037|      CLOSED|
|   25895|2014-01-01 00:00:...|             1044|    COMPLETE|
|   25897|2014-01-01 00:00:...|             6405|    COMPLETE|
|   25898|2014-01-01 00:00:...|             3950|    COMPLETE|
|   25899|2014-01-01 00:00:...|             8068|      CLOSED|
|   25900|2014-01-01 00:00:...|             2382|      CLOSED|
|   25901|2014-01-01 00:00:...|             3099|    COMPLETE|
+--------+--------------------+-----------------+------------+



In [13]:
%%sql

SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'

+--------+
|count(1)|
+--------+
|    2544|
+--------+



* Using Spark SQL with Python or Scala

In [None]:
spark.sql("USE itversity_retail")

In [None]:
spark.sql("SHOW tables").show()

In [None]:
spark.sql("SELECT * FROM orders WHERE order_status = 'COMPLETE'").show()

In [None]:
spark.sql("SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'").show()

In [None]:
spark.sql("SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')").show()

In [None]:
spark.sql("""
SELECT * FROM orders 
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()

In [None]:
spark.sql("""
SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
""").show()

In [None]:
spark.sql("""
SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()

* Let us prepare the table to demonstrate how to deal with null values while filtering the data.

In [1]:
%%sql

DROP DATABASE IF EXISTS itversity_sms CASCADE

++
||
++
++



In [2]:
%%sql

CREATE DATABASE IF NOT EXISTS itversity_sms

++
||
++
++



In [3]:
%%sql

DROP TABLE IF EXISTS students

++
||
++
++



In [4]:
%%sql

CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_number STRING,
    student_address STRING
) STORED AS avro

++
||
++
++



In [5]:
%%sql

INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)

++
||
++
++



In [6]:
%%sql

INSERT INTO students VALUES (2, 'Donald', 'Duck', '1234567890', NULL)

++
||
++
++



In [7]:
%%sql

INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
    (4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')

++
||
++
++



In [8]:
%%sql

SELECT * FROM students

+----------+------------------+-----------------+--------------------+---...


+----------+------------------+-----------------+--------------------+--------------------+
|student_id|student_first_name|student_last_name|student_phone_number|     student_address|
+----------+------------------+-----------------+--------------------+--------------------+
|         2|            Donald|             Duck|          1234567890|                null|
|         1|             Scott|            Tiger|                null|                null|
|         3|            Mickey|            Mouse|          2345678901|A Street, One Cit...|
|         4|            Bubble|            Guppy|          6789012345|Bubbly Street, Gu...|
+----------+------------------+-----------------+--------------------+--------------------+



In [10]:
%%sql
select * from students where student_address is not null

+----------+------------------+-----------------+--------------------+--------------------+
|student_id|student_first_name|student_last_name|student_phone_number|     student_address|
+----------+------------------+-----------------+--------------------+--------------------+
|         3|            Mickey|            Mouse|          2345678901|A Street, One Cit...|
|         4|            Bubble|            Guppy|          6789012345|Bubbly Street, Gu...|
+----------+------------------+-----------------+--------------------+--------------------+



* Using Spark SQL with Python or Scala

In [None]:
spark.sql("DROP DATABASE IF EXISTS itversity_sms CASCADE")

In [None]:
spark.sql("CREATE DATABASE IF NOT EXISTS itversity_sms")

In [None]:
spark.sql("DROP TABLE IF EXISTS students")

In [None]:
spark.sql("""
CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_number STRING,
    student_address STRING
) STORED AS avro
""")

In [None]:
spark.sql("""
INSERT INTO students 
VALUES (1, 'Scott', 'Tiger', NULL, NULL)
""")

In [None]:
spark.sql("""
INSERT INTO students 
VALUES (2, 'Donald', 'Duck', '1234567890', NULL)
""")

In [None]:
spark.sql("""
INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
    (4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')
""")

In [None]:
spark.sql("SELECT * FROM students").show()

* Comparison against null can be done with `IS NULL` and `IS NOT NULL`. Below query will not work even though we have one record with phone_numbers as null.

In [None]:
spark.sql("""
SELECT * FROM students 
WHERE student_phone_number = NULL
""").show()

In [None]:
spark.sql("""
SELECT * FROM students 
WHERE student_phone_number != NULL
""").show()

In [None]:
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NULL
""").show()

In [None]:
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NOT NULL
""").show()