## Loading into Partitions

Let us understand how to use load command to load data into partitioned tables.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002480


itv002480

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").
    master("yarn").
    getOrCreate

username = itv002480
spark = org.apache.spark.sql.SparkSession@3786e5cd


org.apache.spark.sql.SparkSession@3786e5cd

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* We need to make sure that file format of the file which is being loaded into table is same as the file format used while creating the table.
* We also need to make sure that delimiters are consistent between files and table for text file format.
* Also data should match the criteria for the partition into which data is loaded.
* Our `/data/retail_db/orders` have data for the whole year and hence we should not load the data directly into partition.
* We need to split into files matching partition criteria and then load into the table.

To use load command to load the files into partitions we need to pre-partition the data based on partition logic. 

Here is the example of using simple shell commands to partition the data. Use command prompt to run these commands


```shell
rm -rf ~/orders
mkdir -p ~/orders

grep 2013-07 /data/retail_db/orders/part-00000 > ~/orders/orders_201307
grep 2013-08 /data/retail_db/orders/part-00000 > ~/orders/orders_201308
grep 2013-09 /data/retail_db/orders/part-00000 > ~/orders/orders_201309
grep 2013-10 /data/retail_db/orders/part-00000 > ~/orders/orders_201310
```

Let us see how we can load data into corresponding partitions. Data has to be pre-partitioned based on the partitioned column.

In [9]:
%%sql

USE itv002480_retail

++
||
++
++



In [11]:
%%sql

LOAD DATA LOCAL INPATH '/home/itv002480/orders/orders_201307'
  INTO TABLE orders_part PARTITION (order_month=201307)

++
||
++
++



In [12]:
import sys.process._

s"hdfs dfs -ls -R /user/${username}/warehouse/${username}_retail.db/orders_part" !

drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 13:00 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201307
-rwxr-xr-x   3 itv002480 supergroup      64737 2022-05-30 13:00 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201307/orders_201307
drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 12:45 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201308
drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 12:45 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201309
drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 12:45 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201310




0

In [13]:
%%sql

LOAD DATA LOCAL INPATH '/home/itv002480/orders/orders_201308'
  INTO TABLE orders_part PARTITION (order_month=201308)

++
||
++
++



In [14]:
%%sql

LOAD DATA LOCAL INPATH '/home/itv002480/orders/orders_201309'
  INTO TABLE orders_part PARTITION (order_month=201309)

++
||
++
++



In [15]:
%%sql

LOAD DATA LOCAL INPATH '/home/itv002480/orders/orders_201310'
  INTO TABLE orders_part PARTITION (order_month=201310)

++
||
++
++



In [16]:
import sys.process._

s"hdfs dfs -ls -R /user/${username}/warehouse/${username}_retail.db/orders_part" !

drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 13:00 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201307
-rwxr-xr-x   3 itv002480 supergroup      64737 2022-05-30 13:00 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201307/orders_201307
drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 13:00 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201308
-rwxr-xr-x   3 itv002480 supergroup     243190 2022-05-30 13:00 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201308/orders_201308
drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 13:01 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201309
-rwxr-xr-x   3 itv002480 supergroup     251262 2022-05-30 13:01 /user/itv002480/warehouse/itv002480_retail.db/orders_part/order_month=201309/orders_201309
drwxr-xr-x   - itv002480 supergroup          0 2022-05-30 13:01 /user/itv002480/warehouse/itv002480_retail.db/or



0

In [17]:
import sys.process._

s"hdfs dfs -tail /user/${username}/warehouse/${username}_retail.db/orders_part/order_month=201310/orders_201310"!

ETE
67755,2013-10-30 00:00:00.0,8386,PENDING_PAYMENT
67756,2013-10-31 00:00:00.0,6146,PROCESSING
67757,2013-10-31 00:00:00.0,441,CLOSED
67758,2013-10-31 00:00:00.0,6750,CLOSED
67759,2013-10-31 00:00:00.0,10759,PENDING_PAYMENT
68725,2013-10-01 00:00:00.0,7795,PROCESSING
68726,2013-10-02 00:00:00.0,8817,PENDING_PAYMENT
68727,2013-10-05 00:00:00.0,5880,PENDING_PAYMENT
68728,2013-10-07 00:00:00.0,7267,COMPLETE
68729,2013-10-09 00:00:00.0,8043,PENDING_PAYMENT
68730,2013-10-11 00:00:00.0,3568,PENDING_PAYMENT
68731,2013-10-13 00:00:00.0,8102,PENDING_PAYMENT
68732,2013-10-14 00:00:00.0,9990,COMPLETE
68733,2013-10-16 00:00:00.0,12429,CLOSED
68734,2013-10-18 00:00:00.0,8510,ON_HOLD
68735,2013-10-21 00:00:00.0,788,COMPLETE
68736,2013-10-23 00:00:00.0,8462,COMPLETE
68737,2013-10-26 00:00:00.0,10302,PROCESSING
68738,2013-10-27 00:00:00.0,1100,COMPLETE
68739,2013-10-28 00:00:00.0,2528,PENDING
68740,2013-10-29 00:00:00.0,10691,ON_HOLD
68741,2013-10-30 00:00:00.0,5974,PENDING_PAYMENT
68742,2013-10-31 



0

In [18]:
%%sql

SELECT * FROM orders_part LIMIT 10

|    4154|2013-...


+--------+--------------------+-----------------+---------------+-----------+
|order_id|          order_date|order_customer_id|   order_status|order_month|
+--------+--------------------+-----------------+---------------+-----------+
|    4148|2013-08-18 00:00:...|             4941|     PROCESSING|     201308|
|    4149|2013-08-18 00:00:...|            11598|     PROCESSING|     201308|
|    4150|2013-08-18 00:00:...|            10602|     PROCESSING|     201308|
|    4151|2013-08-18 00:00:...|             1108|PENDING_PAYMENT|     201308|
|    4152|2013-08-18 00:00:...|              675|       COMPLETE|     201308|
|    4153|2013-08-18 00:00:...|             2764|PENDING_PAYMENT|     201308|
|    4154|2013-08-18 00:00:...|             8822|PENDING_PAYMENT|     201308|
|    4155|2013-08-18 00:00:...|            12004|         CLOSED|     201308|
|    4156|2013-08-18 00:00:...|              733|     PROCESSING|     201308|
|    4157|2013-08-18 00:00:...|              504|         CLOSED