## Lecture 3 Filtering data

### The `WHERE` clause


```MySQL
SELECT ... FROM tablename WHERE [...];
```

The `WHERE` clause is used to filter rows (NOT columns) according to one or more criteria.


#### Included in results
    [...] -> true
#### Excluded from results
    [...] -> false
    [...] -> NULL


### Using expressions in the `WHERE` clause

- When a SQL engine evaluates an expression, it produces a column of values. 

- When you use expressions in the `SELECT` list, that column of values will be retured in the result set.

- When you use expressions in the `WHERE` clause, that column is used to determined which **rows** are returned.

Some Examples:

Q: Which games are priced below ten dollars?
```MySQL
SELECT name, list_price FROM fun.games WHERE list_price < 10;
```

Q: Which games were invented by Elizabeth Magie?

```MySQL
SELECT name FROM fun.games WHERE inventor='Elizabeth Magie';
```

Q: Which games are suitable for a child age 7?
```MySQL
SELECT name, min_age FROM fun.games WHERE min_age <= 7 AND max_age >= 7;
```


In reality, as a data analyst, when you are doing queries to ansewer some questions, you need to translate questions into boolean expressions. But often the times, the questions given are ambiguous, so you need to ask for clarification or make reasonable assumptions. You have to justify your assumptions when you demostrate you results.

#### Comparison operators

- `=` sigle equal sign for equality comparison
- `<, <=, >, >=` inequality comparison

```shell
$ impala-shell --quiet -q 'SELECT color, red + green + blue AS dark FROM wax.crayons WHERE red + green + blue <= 325'
Starting Impala Shell without Kerberos authentication
+----------------------+------+
| color                | dark |
+----------------------+------+
| Black                | 105  |
| Eggplant             | 287  |
| Electric Lime        | 298  |
| Green                | 320  |
| Midnight Blue        | 216  |
| Outer Space          | 215  |
| Pine Green           | 269  |
| Tropical Rain Forest | 260  |
+----------------------+------+
```

#### Data types and precision

Some experiments

```
[training@localhost ~]$ impala-shell --quiet -q 'select 1=1.0'
Starting Impala Shell without Kerberos authentication
+---------+
| 1 = 1.0 |
+---------+
| true    |
+---------+
[training@localhost ~]$ impala-shell --quiet -q 'select '1'=1.0'
Starting Impala Shell without Kerberos authentication
+---------+
| 1 = 1.0 |
+---------+
| true    |
+---------+
[training@localhost ~]$ impala-shell --quiet -q 'select '1'=1'
Starting Impala Shell without Kerberos authentication
+-------+
| 1 = 1 |
+-------+
| true  |
+-------+
[training@localhost ~]$ impala-shell --quiet -q 'select 1/3 = 0.3333'
Starting Impala Shell without Kerberos authentication
+----------------+
| 1 / 3 = 0.3333 |
+----------------+
| false          |
+----------------+
[training@localhost ~]$ impala-shell --quiet -q 'select 1/3 = 0.3333333333333333333'
Starting Impala Shell without Kerberos authentication
+-------------------------------+
| 1 / 3 = 0.3333333333333333333 |
+-------------------------------+
| true                          |
+-------------------------------+
[training@localhost ~]$ impala-shell --quiet -q 'select round(1/3, 4) = 0.3333 -- better comparision'
Starting Impala Shell without Kerberos authentication
+--------------------------+
| round(1 / 3, 4) = 0.3333 |
+--------------------------+
| true                     |
+--------------------------+
```

#### Logical operators `AND, OR, NOT`

```MySQL
SELECT name, min_age FROM fun.games WHERE min_age <= 7 AND max_age >= 7;
SELECT * FROM fun.games WHERE list_price <= 10 OR name = 'Monopoly';
SELECT * FROM fun.games WHERE name != 'Risk';
SELECT * FROM fun.games WHERE NOT name = 'Risk';
```

Order of operations: `parentheses > NOT > AND > OR`

```MySQL
SELECT * FROM fun.games
  WHERE (list_price <= 10 OR name='Monoploy') AND max_players >= 6;
```

#### Other operators `IN, BETWEEN`

`In`
```MySQL
SELECT * FROM fun.games
  WHERE name IN ('Monopoly', 'Clue', 'Risk') -- concise and readable expression using IN keyword;
```
---
```shell
$ impala-shell --quiet -q "SELECT pack FROM wax.crayons WHERE color IN ('Plum', 'Salmon', 'Vivid Tangerine')"
```

---

`BETWEEN ... AND ... ` (closed intervel)
```MySQL
SELECT * FROM fun.games WHERE min_age BETWEEN 8 AND 10;
SELECT 1 / 3 BETWEEN 0.3332 AND 0.33334 -- work around numerical precision using BETWEEN ... AND ...;
```
---
```SHELL
$ impala-shell --quiet -q "SELECT 1/3 BETWEEN 0.3332 AND 0.3334"
+---------------------------------+
| 1 / 3 between 0.3332 and 0.3334 |
+---------------------------------+
| true                            |
+---------------------------------+

```

### Working with missing values

#### Understanding missing values

Q: Why we should pay attention to missing values (`NULL`)? <br>
A: Rows that with boolean expressions evaluatied to `NULL` are also omitted just like `false`.

We should be careful on how we interpretate query result.

```MySQL
SELECT * FROM fun.inventory WHERE price < 10;
```

The query returns only 1 row, which means there is *at least 1* row that has price less than 10.


#### Handing missing values
You cannot test for `NULL` using standard comparison operators `=, <, >, ...`. Any value compared to `NULL` evaluates to `NULL`.

*operators*
- Use `IS NULL` to check for `NULL` values; use `IS NOT NULL` to check for non-NULL values
```MySQL
SELECT * FROM fun.inventory WHERE price is NULL;
SELECT * FROM fun.inventory WHERE price is NOT NULL;
```
- Use `IS DISTINCT FROM` and `IS NOT DISTINCT FROM` or `<=>`

      
      NULL IS DISTINCT FROM non-NULL -> true    
```SHell
$ impala-shell --quiet -q "SELECT * FROM default.offices WHERE state_province IS DISTINCT FROM 'Illinois'"
+-----------+-----------+----------------+---------+
| office_id | city      | state_province | country |
+-----------+-----------+----------------+---------+
| a         | Istanbul  | Istanbul       | tr      |
| c         | Rosario   | Santa Fe       | ar      |
| d         | Singapore | NULL           | sg      |
+-----------+-----------+----------------+---------+
```     
      NULL != non-NULL value -> NULL


```Shell
$ impala-shell --quiet -q "SELECT * FROM default.offices WHERE state_province != 'Illinois'"
+-----------+----------+----------------+---------+
| office_id | city     | state_province | country |
+-----------+----------+----------------+---------+
| a         | Istanbul | Istanbul       | tr      |
| c         | Rosario  | Santa Fe       | ar      |
+-----------+----------+----------------+---------+

```

Q: Which statements return all rows in `fly.flights` in which `dep_delay` and `arr_delay` are equal or both missing? <br>
A:
```MySQL
SELECT * FROM fly.flights WHERE dep_delay = arr_delay OR (dep_delay is NULL AND arr_delay is NULL);
SELECT * FROM fly.flights WHERE dep_lay <=> arr_delay;
SELECT * FROM fly.flights WHERE dep_lay IS NOT DISTINCT arr_delay;
```

#### Conditional functions

- `if`

```MySql
SELECT shop, game, if(price IS NULL, 8.99, price) AS correct_price
  FROM fun.inventory;
  
SELECT shop, game, if(price > 10, 'high price', 'low or missing price') AS price_category
  FROM fun.inventory;
```

- `case`

```MySql
SELECT shop, game, price,
      CASE WHEN price IS NULL THEN
                'missing price'
           WHEN price > 10 THEN
                'high price'
           ELSE 'low price'
      END AS price_category
  FROM fun.inventory;
```

- `nullif`
    可以处理不合法的运算，如 `0` 做除数。

```SQL
SELECT distance / nullif(air_time, 0) * 60 AS avg_speed FROM fly.flights;
SELECT distance / if(air_time = 0, NULL, air_time) * 60 AS avg_speed FROM fly.flights;
SELECT distance / (CASE WHEN air_time == 0 THEN NULL ELSE air_time END) * 60 AS avg_speed FROM fly.flights;
```

- `ifnull`
    可以充缺失值。
    
```SQL
SELECT ifnull(air_time, 340) AS air_time_no_nulls FROM fly.flights WHERE origin='EWR' AND dest='SFO';
```

- `coalesce`: return the first non-NULL

```SQL
SELECT coalesce(arr_time, sched_arr_time) AS real_or_sched_arr_time FROM fly.flights;
```

### Using Hive and Impala Scripts

#### Variable substitution

##### Beeline
```SQL
-- hivesql.sql
-- set a variable containing the name of the game
SET hivevar:game=Monopoly;
```

```SQL
-- return the list price of the game
SELECT shop, price FROM fun.inventory WHERE game='${hivevar:game}';
```

```SQL
-- return the prices of the game at game shops
SELECT list_price FROM fun.games WHERE name='${hivevar:game}';
```

---
```Shell
$ beeline --silent=true -u jdbc:hive2://localhost:10000 -f hivesql.sql
```
---

```SQL
-- hivesql2.sql
-- variable should be passed in command line
SELECT hex FROM wax.crayons WHERE color='${hivevar:color};
```

---
```Shell
$ beeline --silent=true -u jdbc:hive2://localhost:10000 --hivevar color="Red" -f hivesql2.sql
```

##### Impala

Use `var` instead of `hivevar`

Quiz: Which commands correctly pass a string parameter called `month` to a Beeline or Impala Shell query that runs the file `report.sql`?

(AB)

A. 

```shell
$ beeline -u jdbc:hive2://localhost:10000 --hivevar month="January" -f report.sql
```

B. 
```shell
$ impala-shell --var month="January" -f report.sql
```

#### Calling Beeline and Impala from shell scripts

```bash
#!/bin/bash
impala-shell \
    --quiet \
    --delimited \
    --output_delimiter=',' \ 
    --print_header \
    -q 'SELECT * FROM fly.flights WHERE air_time=0;' \
    -o zero_air_time.csv
mail \
    -a zero_air_time.csv \
    -s 'Flight with zero air_time' \
    fly@example.com \
    <<< 'Do you know why air_time is zero in these rows?'
```
---

```shell
> chmod 775 email_results.sh
> ./email_results.sh
```
---

```python
# python
import subprocess
subprocess.call(['./email_results.sh'])
```

#### Querying from applications

```python
from impala.dbapi import connect

conn = connect(host='localhost', port=21050)
cursor = conn.cursor()
cursor.execute('SELECT * FROM fun.games')
results = cursor.fetchall()
for row in results:
    print(row)
```