<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md) 

# Lab 4.5 : Data formats (JSON vs. Parquet vs. ORC)


### Overview
Comparing different data formats for Dataframes.  We will evaluate JSON, Parquet and ORC format.

Background reads:
- [Spark data frames](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- JSON format 
    - [wikipedia](https://en.wikipedia.org/wiki/JSON)
    - [json.org](http://json.org/)
- Parquet format
    - [Parquet project](https://parquet.apache.org/)
    - [parquet github](https://github.com/Parquet/parquet-format)
    - [presentation](http://www.slideshare.net/larsgeorge/parquet-data-io-philadelphia-2013)
- ORC format
    + [ORC project](https://orc.apache.org/)
    + [ORC explained](http://www.semantikoz.com/blog/orc-intelligent-big-data-file-format-hadoop-hive/)
    + [ORC performance](http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_performance_tuning/content/hive_perf_best_pract_use_orc_file_format.html)

### Depends On 
None

### Run time
20-30 mins


## STEP 1: Clickstream data
There is about 1G+ clickstream data stored in `/data/click-stream/json` directory.  

They look like this

```json
{"timestamp": 1420070400000, "ip": "ip_557", "user": "user_13011", "action": "blocked", "domain": "npr.org", "campaign": "campaign_13", "cost": 116, "session": "session_43"}

{"timestamp": 1420070400043, "ip": "ip_129", "user": "user_58773", "action": "clicked", "domain": "flickr.com", "campaign": "campaign_7", "cost": 170, "session": "session_23"}

{"timestamp": 1420070400086, "ip": "ip_704", "user": "user_71191", "action": "viewed", "domain": "foxnews.com", "campaign": "campaign_20", "cost": 47, "session": "session_48"}

```

#### [Optional] If you need to generate more data....
```bash
    $    cd   /data/click-stream/
    $    python   gen-clickstream-json.py
```

## STEP 2: Benchmarking Spreadsheet
Download and inspect [Benchmarking_Dataformats.xlsx](Benchmarking_Dataformats.xlsx).  
**We will be filling out the values in this spreadsheet, as we execute commands on Spark Shell.**

It will look like this (click on the image for larger version)

<a href="../assets/images/5.3a.png"><img src="../assets/images/5.3a-small.png" style="border: 5px solid grey; max-width:100%;"/></a>



In [1]:
import time

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

Spark UI running on http://YOURIPADDRESS:4040


In [None]:
sc.setLogLevel("INFO")
print("log level set to INFO")

## STEP 3: ATOP

Also open another terminal and run **atop**.  
We will use this to monitor CPU / IO usage 


## STEP 4: Load Clickstream data

In [None]:
import time

# load all the files in the dir
t1 = time.perf_counter()

clicksJson = spark.read.json("/data/click-stream/json/")

t2 = time.perf_counter()
print ("Read JSON in {:,.2f} ms ".format( (t2-t1)*1000))
print(clicksJson)

**==> While the import is running take a look at `atop` terminal.  Which of the resources are we maxing out?**  
**==> Measure the time taken to load JSON data; record it in the spreadsheet**  

**==> Find the max value of cost**   
**==> While the query is running, check `atop`**

In [None]:
import time
from pyspark.sql import *

clicksJson.createOrReplaceTempView("clicks_json")

t1 = time.perf_counter()
spark.sql("SELECT MAX(cost) FROM clicks_json").show()
t2 = time.perf_counter()

print ("MAX in JSON in {:,.2f} ms ".format( (t2-t1)*1000))

Sample output
```
    +---------+
    |MAX(cost)|
    +---------+
    |      180|
    +---------+
```

**==> Note the time it took to run the query, and record it in spreadsheet**
```
Job 1 finished: show at <console>:24, took `8.550481 s`
```

## STEP 6 : Save the logs in Parquet format

We are going to use Spark's built-in parquet support to save the dataframe into parquet format

In [None]:
import time
t1 = time.perf_counter()

clicksJson.write.parquet("/data/click-stream/my-parquet")

t2 = time.perf_counter()
print ("Wrote Parquet in {:,.2f} ms ".format( (t2-t1)*1000))

**==> Inspect `atop` terminal**  
**==> Measure the time taken to 'save as parquet' and record it in spreadsheet**  

## Step 7 : Saving ORC

In [None]:
import time
t1 = time.perf_counter()

clicksJson.write.orc("/data/click-stream/my-orc")

t2 = time.perf_counter()
print ("Wrote ORC in {:,.2f} ms ".format( (t2-t1)*1000))

**==> Measure the time taken to save as ORC and record in spreadsheet**   

## STEP 8 : Querying Parquet Data

In [None]:
import time
t1 = time.perf_counter()

clicksParquet = spark.read.parquet("/data/click-stream/my-parquet")

t2 = time.perf_counter()
print ("Read Parquet in {:,.2f} ms ".format( (t2-t1)*1000))

clicksParquet.createOrReplaceTempView("clicks_parquet")

**==> Note how quickly the data is loaded; measure this time and record in spreadsheet**   
**==> and schema is inferred!**  

Parquet format has built-in schema, so Spark doesn't have to parse the files as needed in JSON format

**==> Caclculate max(cost)**

In [None]:
import time
t1 = time.perf_counter()

spark.sql("SELECT MAX(cost) FROM clicks_parquet").show()

t2 = time.perf_counter()
print ("MAX Parquet in {:,.2f} ms ".format( (t2-t1)*1000))

**==> Notice the time took and record in spreadsheet**    
Sample output

Job 3 finished: show at <console>:24, took `0.627185 s`

**==> Why parquet is so quick to process?** 


## STEP 9 : Querying ORC

In [None]:
import time
t1 = time.perf_counter()

clicksORC = spark.read.orc("/data/click-stream/my-orc")

t2 = time.perf_counter()
print ("Read ORC in {:,.2f} ms ".format( (t2-t1)*1000))

clicksORC.createOrReplaceTempView("clicks_orc")

**==> Note the load time and record in spreadsheet**   

**==> Measure query time and record in spreadsheet**

In [None]:
import time
t1 = time.perf_counter()

spark.sql("SELECT MAX(cost) FROM clicks_orc").show()

t2 = time.perf_counter()
print ("MAX ORC in {:,.2f} ms ".format( (t2-t1)*1000))

## Step 10 : Compare Data Sizes
Open a terminal and run the following command.

```bash
# bytes for spreadsheet
    $    du -b  ~/data/click-stream/*
    # in Mac use `du -k`

    # for human readable format use
    $    du -skh  ~/data/click-stream/*
```

Sample output

```
    1415178847  /Users/sujee/data/click-stream/json
    161398938   /Users/sujee/data/click-stream/json-gz
    105793926   /Users/sujee/data/click-stream/orc
    118394196   /Users/sujee/data/click-stream/parquet
```

**==> Record the byte sizes in spreadsheet**  

## BONUS : Compressed JSON

We are going to store JSON files in compressed gzip format

**==> Compress the files**

```bash
$    cd   ~/data/click-stream
$   ./compress-json.sh
```

This will create compressed JSON in `json-gz` directory

**==> Inspect directory sizes**

```bash
    # bytes for spreadsheet
    $    du -b json    json-gz   parquet 

    # human readable format
    $    du -skh  json    json-gz   parquet 
```

Sample output

```
1.3G    json
154M    json-gz
 77M    parquet
```

**==> Load compressed json files in Spark shell and do the same processing**  
**==> Look at `atop` window to see resource usage**

In [None]:
#note the parsing time
clicksJgz = spark.read.json("/data/click-stream/json-gz")
clicksParquet.createOrReplaceTempView("clicks_jsongz")



# calculate the max cost
#notice the time took
spark.sql("SELECT MAX(cost) FROM clicks_jsongz").show()

# output : Job 7 finished: show at console:22, took 8.066727 s


### STEP 9 : Analyze / discuss results

Here are numbers from my run:

```
|format   | storage size |  loading time | query time : max(cost)|
|---------|:-------------|:--------------|:---------------------:|
| json    |  1.3 G       |  8.3 s        |   4.6 s               |
| json.gz |  154 M       |  8.5 s        |   4.1 s               | 
| parquet |  101 M       |    0 s        |   0.23 s              | 
| ORC     |  113 M       |    0 s        |   0.76 s              | 
```

**==> Also discuss your findings from `atop`.  Which resource 'ceiling' we are hitting first?  CPU / Memory / Disk ?**