JupyterLab allows writing SQL query directly in-cell, thanks to the `%sparksql` magic command (use two percent signs `%%sparksql` to span code in multiple lines). An amazing feature is that PySpark can also interacts with this enviroment. This means all local files can be read as Hive tables.

In [1]:
import findspark; findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
spark.conf.set('spark.sql.repl.eagerEval.truncate', 80)
spark.conf.set('spark.sql.repl.eagerEval.maxNumRows', 20)

import pyspark.sql.functions as F
import pyspark.sql.types as T

In [2]:
%load_ext sparksql_magic

In [4]:
%config SparkSql.limit=20

# 1. Managing tables

## 1.1. Creating tables

#### Manually creating tables

In [12]:
%%sparksql

CREATE EXTERNAL TABLE IF NOT EXISTS tbl_product
STORED AS PARQUET
LOCATION 'spark_db/tbl_product'
TBLPROPERTIES ('parquet.compression'='snappy')

SELECT *
FROM VALUES
    ('Laptop', 1000, 15),
    ('Mouse', 20, 100),
    ('Headphone', 50, 50),
    ('USB', NULL, 100)
AS (product, price, stock)

#### Metadata

|Statement|Usage|
|:--|:--|
|`DESC <table_name>`|Show columns and comments of a table|
|`DESC FORMATTED <table_name>`|Show detailed information of a table|
|`SHOW CREATE TABLE <table_name>`|Get the script that created the table|
|`DROP TABLE IF EXISTS <table_name>`|Drop a table|
|`SHOW DATABASES`|Show all available databases|
|`DESC DATABASE EXTENDED <database_name>`|Show information about a database|
|`USE <database_name>`|Enter a specific database|
|`SHOW TABLES`|Show all available tables and view|
|`SHOW TABLES LIKE <pattern>`| Show all tables having a specific pattern in their name|

In [4]:
%%sparksql

DESC tbl_product

0,1,2
col_name,data_type,comment
product,string,
price,int,
stock,int,


In [3]:
%%sparksql

SHOW TABLES LIKE '*product*'

0,1,2
database,tableName,isTemporary
default,tbl_product,False


In [4]:
%%sparksql

ALTER TABLE tbl_product SET TBLPROPERTIES('external'='false', 'auto.purge'='true')

In [16]:
%%sparksql

DROP TABLE IF EXISTS tbl_product

#### Partitioning
A unique feature of SparkSQL is organizing tables in partitions, which helps achieve more parallelism. A categorical column or two may be used as partition columns. Data can be inserted to a partition using `INSERT INTO TABLE` or `INSERT OVERWRITE TABLE`.

In [17]:
%%sparksql

CREATE TABLE IF NOT EXISTS tbl_product (
    product STRING COMMENT 'name of product',
    price INT COMMENT 'price of product',
    stock INT COMMENT 'number of products left'
)
PARTITIONED BY (day STRING COMMENT 'day', hour STRING COMMENT 'hour')
STORED AS PARQUET
LOCATION 'spark_db/tbl_product'
TBLPROPERTIES ('parquet.compression'='snappy')

In [5]:
%%sparksql

INSERT OVERWRITE TABLE tbl_product
PARTITION (day=20210725, hour=14)
VALUES
    ('Laptop', 1000, 25),
    ('Mouse', 30, 100),
    ('Headphone', 50, 50)

In [6]:
%%sparksql

INSERT INTO TABLE tbl_product
PARTITION (day=20210725, hour=21)
VALUES
    ('Laptop', 1000, 20),
    ('Mouse', 20, 97),
    ('Headphone', 65, 12)

In [7]:
%%sparksql

SHOW PARTITIONS tbl_product

0
partition
day=20210725/hour=14
day=20210725/hour=21


In [22]:
%%sparksql

ALTER TABLE tbl_product DROP IF EXISTS PARTITION (day=20210725)

## 1.2. Importing local files

In [5]:
df = spark.read.csv('data/youtube_trending.csv', header=True, inferSchema=True)

df\
    .write.format('parquet')\
    .option('path', 'spark_db/tbl_youtube')\
    .option('compression', 'snappy')\
    .mode('overwrite').saveAsTable('tbl_youtube')

In [20]:
%%sparksql

SELECT * FROM tbl_youtube LIMIT 5

0,1,2,3,4,5,6,7,8,9,10
video_id,trending_date,channel_title,category_id,publish_time,views,likes,dislikes,comment_count,comments_disabled,ratings_disabled
2kyS6SvSYSE,2017-11-14,CaseyNeistat,22,2017-11-14 00:13:01,748374,57527,2966,15954,False,False
1ZAPwfrtAFY,2017-11-14,LastWeekTonight,24,2017-11-13 14:30:00,2418783,97185,6146,12703,False,False
5qpjK5DgCt4,2017-11-14,Rudy Mancuso,23,2017-11-13 02:05:24,3191434,146033,5339,8181,False,False
puqaWrEC7tY,2017-11-14,Good Mythical Morning,24,2017-11-13 18:00:04,343168,10172,666,2146,False,False
d380meD0W0M,2017-11-14,nigahiga,24,2017-11-13 01:01:41,2095731,132235,1989,17518,False,False


## 1.3. Hive data types
Data types and coverting between them, especially datetime.

## 1.4. Aliases
- Table aliases
- Column aliases
- Use `` for specifying names: 

```sql
`schema`.`database`.`table`.`column`
```

# 2. Data manipulation

## 2.1. Filtering
```sql
RLIKE | LIKE | IN | IS NULL
```
note: how to use NOT, AND, OR

## 2.2. Aggregating

```sql
GROUP BY: COUNT(*), COUNT(DISTINCT), COUNT(CASE WHEN), SUM, SUM(DISTINCT), AVG, AVG(DISTINCT), MIN, MAX, VAR_POP, STDDEV_POP, PERCENTILE
```

*Reference: [Apache Hive - Aggregate functions](https://cwiki.apache.org/confluence/display/hive/languagemanual+udf#LanguageManualUDF-Built-inAggregateFunctions(UDAF))*

## 2.3. Window functions

- Windows functions: ROW_NUMBER, RANK, DENSE_RANK, NTILE,... (same as in pyspark)
- Special uses of SUM, COUNT (with or without PARTITION BY)
- ROW|RANGE BETWEEN, UNBOUNDED, PRECEDING, CURRENT_ROW,...

*Reference: [Apache Hive - Window functions](https://cwiki.apache.org/confluence/display/hive/languagemanual+windowingandanalytics)*

## 2.3. Gathering data
- `JOIN`: cross, left, right, inner, outer
- `UNION`: all

## 2.4. Order of execution
```sql
FROM -> JOIN [ON] -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY -> LIMIT
```

## 2.5. Functions
```sql
PI, RAND, LOG, SQRT, POW/POWER, CONCAT, CONCAT_WS, NVL, NVL2, REGEXP_REPLACE, REGEXP_EXTRACT, SPLIT, GREATEST, LEAST, LOWER/UPPER, LENGTH, NULLIF, LPAD/RPAD, LTRIM/RTRIM/TRIM, SUBSTR/SUBSTRING, CASE WHEN
```

*Reference: [Apache Hive - Built-in functions](https://cwiki.apache.org/confluence/display/hive/languagemanual+udf#LanguageManualUDF-Built-inFunctions)*

# 3. Data structures

## 3.1. Array type
- Definition: same as Numpy's array
- Schema: `ARRAY<STRING>`
- Inserting: `ARRAY('hung', 'linh',...)`
- Accessing: `A[1]`, start with 0
- Techniques:
    - Unpacking: `LATERAL VIEW, EXPLODE, POSEXPLODE, INLINE,...`
    - Higher order functions: `TRANSFORM, FILTER, EXISTS, AGGREGATE` ([read more](https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html))
    - Basic functions: `SIZE, ARRAY_CONTAINS, SORT_ARRAY, CONCAT_WS, SEQUENCE,...` (same as in pyspark)

## 3.2. Struct type
- Definition: two or more arrays zipped together
- Schema: `ARRAY<STRUCT<id:INT, name:STRING, interest:STRING>>`
- Inserting: `ARRAY(STRUCT(0, 'hung', 'buom'), STRUCT(1, 'linh', 'chim'))`
- Accessing: `S.id, S.name, S.interest`
- Techniques: `ARRAYS_ZIP`

## 3.3. Map type
- Definition: same as Python's dict
- Schema: `MAP<STRING, STRING>`
- Inserting: `MAP('0', 'hung', '1', 'linh')`
- Accessing: `M['0']`
- Techniques:
    - `LATERAL VIEW, EXPLODE, POSEEXPLODE, INLINE,...`
    - `MAP_KEYS, MAP_VALUES, STR_TO_MAP`
    - `GET_JSON_OBJECT`