## Preparing Tables

Let us prepare the tables to solve the problem.

* Make sure database is created.
* Create **orders** table.
* Load data from local path **/data/retail_db/orders** into newly created **orders** table.
* Preview data and get count from **orders**
* Create **order_items** table.
* Load data from local path **/data/retail_db/order_items** into newly created **orders** table.
* Preview data and get count from **order_items**

As tables and data are ready let us get into how to write queries against tables to perform basic transformation.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.spark.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = spark


spark

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Basic Transformations").
    master("yarn").
    getOrCreate

username = spark
spark = org.apache.spark.sql.SparkSession@4c4dbcd1


org.apache.spark.sql.SparkSession@4c4dbcd1

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [3]:
%%sql

DROP DATABASE spark_retail CASCADE

Waiting for a Spark session to start...

Magic sql failed to execute with error: 
org.apache.hadoop.hive.metastore.api.NoSuchObjectException: spark_retail;

In [4]:
%%sql

CREATE DATABASE IF NOT EXISTS spark_retail

++
||
++
++



In [5]:
%%sql
USE spark_retail

++
||
++
++



In [6]:
%%sql
SHOW tables

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+



In [7]:
%%sql

DROP TABLE orders

Magic sql failed to execute with error: 
Table or view not found: orders;

In [8]:
%%sql

CREATE TABLE orders (
    order_id INT,
    order_date STRING,
    order_customer_id INT,
    order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

++
||
++
++



In [9]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/orders"!

username = spark




0

In [10]:
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/orders' INTO TABLE orders

++
||
++
++



In [11]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/orders"!

Found 1 items
-rwxr-xr-x   1 spark supergroup    2999944 2022-02-02 04:07 /user/spark/warehouse/spark_retail.db/orders/part-00000


username = spark




0

In [12]:
%%sql

SELECT * FROM orders LIMIT 10

|       8|2013-07-25 00:00:...|             2911|     PRO...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
+--------+--------------------+-----------------+---------------+



In [13]:
%%sql

SELECT count(1) FROM orders

+--------+
|count(1)|
+--------+
|   68883|
+--------+



In [14]:
%%sql

DROP TABLE order_items

Magic sql failed to execute with error: 
Table or view not found: order_items;

In [15]:
%%sql 

CREATE TABLE order_items (
    order_item_id INT,
    order_item_order_id INT,
    order_item_product_id INT,
    order_item_quantity INT,
    order_item_subtotal FLOAT,
    order_item_product_price FLOAT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

++
||
++
++



In [16]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/order_items"!

username = spark




0

In [17]:
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/order_items' INTO TABLE order_items

++
||
++
++



In [18]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/order_items"!

Found 1 items
-rwxr-xr-x   1 spark supergroup    5408880 2022-02-02 03:20 /user/spark/warehouse/spark_retail.db/order_items/part-00000


username = spark




0

In [19]:
%%sql

SELECT * FROM order_items LIMIT 10

|            3|                  2|                  502|                  5|              250.0|     ...


+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [20]:
%%sql

SELECT count(1) FROM order_items

+--------+
|count(1)|
+--------+
|  172198|
+--------+



* Using Spark SQL with Python or Scala

In [21]:
spark.sql("DROP DATABASE spark_retail CASCADE")

[]

In [22]:
spark.sql("CREATE DATABASE IF NOT EXISTS spark_retail")

[]

In [23]:
spark.sql("USE spark_retail")

[]

In [24]:
spark.sql("SHOW tables")

[database: string, tableName: string ... 1 more field]

In [25]:
spark.sql("DROP TABLE orders")

org.apache.spark.sql.AnalysisException: Table or view not found: orders;

In [26]:
spark.sql("""
CREATE TABLE orders (
    order_id INT,
    order_date STRING,
    order_customer_id INT,
    order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
""")

lastException: Throwable = null


[]

In [27]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/orders"!

username = spark




0

In [28]:
spark.sql("LOAD DATA LOCAL INPATH '/data/retail_db/orders' INTO TABLE orders")

[]

In [29]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/orders"!

Found 1 items
-rwxr-xr-x   1 spark supergroup    2999944 2022-02-02 03:20 /user/spark/warehouse/spark_retail.db/orders/part-00000


username = spark




0

In [30]:
spark.sql("SELECT * FROM orders LIMIT 10").show()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   34565|2014-02-23 00:00:...|             8702|       COMPLETE|
|   34566|2014-02-23 00:00:...|             3066|PENDING_PAYMENT|
|   34567|2014-02-23 00:00:...|             7314|SUSPECTED_FRAUD|
|   34568|2014-02-23 00:00:...|             1271|       COMPLETE|
|   34569|2014-02-23 00:00:...|            11083|       COMPLETE|
|   34570|2014-02-23 00:00:...|             3159|         CLOSED|
|   34571|2014-02-23 00:00:...|             4551|         CLOSED|
|   34572|2014-02-23 00:00:...|             8135|        PENDING|
|   34573|2014-02-23 00:00:...|             7497|PENDING_PAYMENT|
|   34574|2014-02-23 00:00:...|             1868|        ON_HOLD|
+--------+--------------------+-----------------+---------------+



In [31]:
spark.sql("SELECT count(1) FROM orders").show()

+--------+
|count(1)|
+--------+
|   68883|
+--------+



In [32]:
spark.sql("DROP TABLE order_items")

org.apache.spark.sql.AnalysisException: Table or view not found: order_items;

In [33]:
spark.sql("""
CREATE TABLE order_items (
    order_item_id INT,
    order_item_order_id INT,
    order_item_product_id INT,
    order_item_quantity INT,
    order_item_subtotal FLOAT,
    order_item_product_price FLOAT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
""")

lastException: Throwable = null


[]

In [34]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/order_items"!

username = spark




0

In [35]:
spark.sql("LOAD DATA LOCAL INPATH '/data/retail_db/order_items' INTO TABLE order_items")

[]

In [36]:
import sys.process._
val username = System.getProperty("user.name")
s"hdfs dfs -ls /user/spark/warehouse/${username}_retail.db/order_items"!

Found 1 items
-rwxr-xr-x   1 spark supergroup    5408880 2022-02-02 03:21 /user/spark/warehouse/spark_retail.db/order_items/part-00000


username = spark




0

In [37]:
spark.sql("SELECT * FROM order_items LIMIT 10").show()

+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [38]:
spark.sql("SELECT count(1) FROM order_items").show()

+--------+
|count(1)|
+--------+
|  172198|
+--------+

