## Filtering Data

Let us understand how we can filter the data in Spark SQL.

In [10]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/x6sUnQ553Ow?rel=0&amp;controls=1&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [None]:
val username = System.getProperty("user.name")

In [None]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* We use `WHERE` clause to filter the data.
* All comparison operators such as `=`, `!=`, `>`, `<`, etc can be used to compare a column or expression or literal with another column or expression or literal.
* We can use operators such as LIKE with % and regexp_like for pattern matching.
* Boolan OR and AND can be performed when we want to apply multiple conditions.
  * Get all orders with order_status equals to COMPLETE or CLOSED. We can also use IN operator.
  * Get all orders from month 2014 January with order_status equals to COMPLETE or CLOSED
* We need to use `IS NULL` and `IS NOT NULL` to compare against null values.

In [None]:
%%sql

USE itversity_retail

In [None]:
%%sql

SHOW tables

In [None]:
%%sql

SELECT * FROM orders WHERE order_status = 'COMPLETE' LIMIT 10

In [None]:
%%sql

SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'

In [None]:
%%sql

SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED') LIMIT 10

In [None]:
%%sql

SELECT * FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED' LIMIT 10

In [None]:
%%sql

SELECT count(1) FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')

In [None]:
%%sql

SELECT count(1) FROM orders WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'

In [None]:
%%sql

SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
LIMIT 10

In [None]:
%%sql

SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'

In [None]:
%%sql

SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
LIMIT 10

In [None]:
%%sql

SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'

* Using Spark SQL with Python or Scala

In [None]:
spark.sql("USE itversity_retail")

In [None]:
spark.sql("SHOW tables").show()

In [None]:
spark.sql("SELECT * FROM orders WHERE order_status = 'COMPLETE'").show()

In [None]:
spark.sql("SELECT count(1) FROM orders WHERE order_status = 'COMPLETE'").show()

In [None]:
spark.sql("SELECT * FROM orders WHERE order_status IN ('COMPLETE', 'CLOSED')").show()

In [None]:
spark.sql("""
SELECT * FROM orders 
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders
WHERE order_status = 'COMPLETE' OR order_status = 'CLOSED'
""").show()

In [None]:
spark.sql("""
SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND order_date LIKE '2014-01%'
""").show()

In [None]:
spark.sql("""
SELECT * FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()

In [None]:
spark.sql("""
SELECT count(1) FROM orders 
WHERE order_status IN ('COMPLETE', 'CLOSED')
    AND date_format(order_date, 'yyyy-MM') = '2014-01'
""").show()

* Let us prepare the table to demonstrate how to deal with null values while filtering the data.

In [None]:
%%sql

DROP DATABASE IF EXISTS itversity_sms CASCADE

In [None]:
%%sql

CREATE DATABASE IF NOT EXISTS itversity_sms

In [None]:
%%sql

DROP TABLE IF EXISTS students

In [None]:
%%sql

CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_number STRING,
    student_address STRING
) STORED AS avro

In [None]:
%%sql

INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)

In [None]:
%%sql

INSERT INTO students VALUES (2, 'Donald', 'Duck', '1234567890', NULL)

In [None]:
%%sql

INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
    (4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')

In [None]:
%%sql

SELECT * FROM students

* Using Spark SQL with Python or Scala

In [None]:
spark.sql("DROP DATABASE IF EXISTS itversity_sms CASCADE")

In [None]:
spark.sql("CREATE DATABASE IF NOT EXISTS itversity_sms")

In [None]:
spark.sql("DROP TABLE IF EXISTS students")

In [None]:
spark.sql("""
CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_number STRING,
    student_address STRING
) STORED AS avro
""")

In [None]:
spark.sql("""
INSERT INTO students 
VALUES (1, 'Scott', 'Tiger', NULL, NULL)
""")

In [None]:
spark.sql("""
INSERT INTO students 
VALUES (2, 'Donald', 'Duck', '1234567890', NULL)
""")

In [None]:
spark.sql("""
INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', '2345678901', 'A Street, One City, Some State, 12345'),
    (4, 'Bubble', 'Guppy', '6789012345', 'Bubbly Street, Guppy, La la land, 45678')
""")

In [None]:
spark.sql("SELECT * FROM students").show()

* Comparison against null can be done with `IS NULL` and `IS NOT NULL`. Below query will not work even though we have one record with phone_numbers as null.

In [None]:
spark.sql("""
SELECT * FROM students 
WHERE student_phone_number = NULL
""").show()

In [None]:
spark.sql("""
SELECT * FROM students 
WHERE student_phone_number != NULL
""").show()

In [None]:
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NULL
""").show()

In [None]:
spark.sql("""
SELECT * FROM students
WHERE student_phone_number IS NOT NULL
""").show()