# ST446 Distributed Computing for Big Data
## Homework 1: Spark RDDs, Spark SQL and Hive
### Milan Vojnovic and Christine Yuen, LT 2018

---

## Instructions:

**Deadline**: February 20, 2018, 5pm London time

**Datasets**: All the datasets are available for download from here:

https://www.dropbox.com/sh/89xbpcjl4oq0j4w/AACrbtUzm3oCW1OcpL7BasRfa?dl=0.


## A. Spark RDDs (30 points)

1. We continue to analyse the dblp dataset available in the file `large_author.txt`. This time we want to find the top 10 author pairs who jointly published the largest number of papers (with possible other collaborators). For example, if authors _a_, _b_ and _c_ published a paper with title _t_, then this contributes one joint publication for each author pair (_a_,_b_), (_b_,_c_) and (_a_,_c_). Use the first column of the input data for the author names and use the third column of the input data for the publication title. You need to solve this task by using RDD operations like those in _rdd.ipynb_ in week 3 of the course and the [Spark RDD documentation]( http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.RDD). You need to run your code on your laptop.

2. Run the same code as in the previous item but on Google Cloud Platform. Provide us with a copy and paste of the terminal commands that you used as well as with screenshots if you run this using a web user interface.

In [2]:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('1A') \
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.set("spark.kryoserializer.buffer.max", "128m") \
.set("spark.kryoserializer.buffer", "64m") 

sc = SparkContext.getOrCreate(conf=conf)

textrdd = sc.textFile("file///C:/hduser/author-large.txt")

rdd1 = textrdd.map(lambda x: x.split("\t", 1)) \
            .map(lambda x: (x[1],x[0]))

# Make a copy of the rdd    
rdd1_copy = rdd1

# Do a join on both rdds to make pairs
rddop = rdd1.join(rdd1_copy)

# Outputs a list which shows top 20 pairs according to their counts, in descending order.
rddop_1 = rddop.map(lambda x: x[1]) \
               .filter(lambda x: x[0]!=x[1]) \
               .map(lambda x: (x,1)) \
               .reduceByKey(lambda x,y: x+y) \
               .takeOrdered(20, key = lambda x: -x[1])

# Define a funciton to remove mirror duplicates
def removeMirorDups(list):
        emptylist = []
        for p1 in range(0, len(list)):
                for p2 in range(p1+1,len(list)):
                    if (list[p1][0][0]==list[p2][0][1] and list[p1][0][1]==list[p2][0][0]):
                        emptylist.append(list[p1])
        return emptylist

# Apply function to the list, assign the output to a new list variable
rddop_2 = removeMirorDups(rddop_1)

# Outputs top 10 author pairs
rddop_2[:]

[(('Sudhakar M. Reddy', 'Irith Pomeranz'), 247),
 (('Divyakant Agrawal', 'Amr El Abbadi'), 161),
 (('Makoto Takizawa', 'Tomoya Enokido'), 138),
 (('Henri Prade', 'Didier Dubois'), 122),
 (('Tharam S. Dillon', 'Elizabeth Chang'), 116),
 (('Mary Jane Irwin', 'Narayanan Vijaykrishnan'), 107),
 (('Mahmut T. Kandemir', 'Mary Jane Irwin'), 100),
 (('Chun Chen', 'Jiajun Bu'), 99),
 (('Takahiro Hara', 'Shojiro Nishio'), 96),
 (('Maurizio Lenzerini', 'Giuseppe De Giacomo'), 91)]

## 2) Run the same code as in the previous item but on Google Cloud Platform

- I uploaded author-large.txt into my bucket.
- Then SSH into my Cloud Dataproc cluster's master node to access the terminal window that is connected to the master instance.

![1](gcp_pyspark0.png)

Running the code line by line.

![1](gcp_pyspark1.png)

## B. Spark SQL (30 points)

Do the same as in problem A except that now you need to use Spark SQL API, which we covered in week 4 of the course. You may find useful to consult 'Querying with Spark SQL' in _spark-dataframe-sql.ipynb_ of week 4 class and the [Spark SQL documentation](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html).

In [3]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = SparkSession \
       .builder \
       .appName("Homework_1b") \
       .getOrCreate()


textrdd = sc.textFile("file///C:/hduser/author-large.txt") \
            .map(lambda x: x.strip().split("\t",1)) \
            .map(lambda x: Row(author_name=x[0],book_name=x[1]))

# Convert rdd to dataframe
textdf = sqlContext.createDataFrame(textrdd)

# Make a copy of the dataframe
textdf_copy = textdf

# Change column name from 'author_name' to 'author_name2'
textdf_copy = textdf_copy.withColumnRenamed('author_name', 'author_name2')

# Do a full outer join on the dataframes to make author name pairs,
# remove all rows which have pairs with the same names.
textdf_pairs = textdf.join(textdf_copy, 'book_name', 'outer')
textdf_pairs = textdf_pairs.filter(textdf_pairs['author_name']!=textdf_pairs['author_name2'])

# Do a count on the author pairs, in descending order.
textdf_pairs_2 = textdf_pairs.groupby('author_name','author_name2').count() \
                             .orderBy('count',ascending=False)

# Setting window parameter
my_window = Window.partitionBy().orderBy(textdf_pairs_2['count'].desc())

# Create a new column named 'prev_value', it stores previous row's author_name2' entry in the current row, 
# using the row configurations defined by window parameter
textdf_pairs_3 = textdf_pairs_2.withColumn("prev_row_name", F.lag(textdf_pairs_2.author_name2).over(my_window))

# Create a new column named 'remove', it outputs 1 if entry of author_name is the same as prev_row_name, 0 otherwise.
# In other words, remove=1 means there exists a mirror duplicate of author name pairs.
textdf_pairs_4 = textdf_pairs_3.withColumn("remove", F.when(textdf_pairs_3.author_name == textdf_pairs_3.prev_row_name, 1) \
                                                          .otherwise(0))

# Remove all rows with 'remove'=1, and show only 3 columns.
textdf_pairs_5 = textdf_pairs_4.filter(textdf_pairs_4['remove']==0) \
                                .select('author_name', 'author_name2', 'count')

textdf_pairs_5.show(10, False) 

+-------------------+-----------------------+-----+
|author_name        |author_name2           |count|
+-------------------+-----------------------+-----+
|Irith Pomeranz     |Sudhakar M. Reddy      |247  |
|Amr El Abbadi      |Divyakant Agrawal      |161  |
|Tomoya Enokido     |Makoto Takizawa        |138  |
|Henri Prade        |Didier Dubois          |122  |
|Tharam S. Dillon   |Elizabeth Chang        |116  |
|Mary Jane Irwin    |Narayanan Vijaykrishnan|107  |
|Mahmut T. Kandemir |Mary Jane Irwin        |100  |
|Chun Chen          |Jiajun Bu              |99   |
|Takahiro Hara      |Shojiro Nishio         |96   |
|Giuseppe De Giacomo|Maurizio Lenzerini     |91   |
+-------------------+-----------------------+-----+
only showing top 10 rows



## C. Hive (40 points)

In this part we are going to use the Yelp data available in the following JSON file *Yelp/yelp_academic_dataset_user.json*. You may complete this task by using either Hive installed on your laptop or using Hive on Google Cloud Platform. You need to complete the following steps:

### 1. Load data into a Hive table

Create a Hive table and load the input data into this table.

Please describe any commmands that you run in a command line interface and provide all the code that you wrote and run. For example, this may include any commands run in a terminal, Hive script files (\*.sql), and screenshots (if, for example, you used Google Cloud Platform). See the class examples for references.

Note:
* The dataset is in JSON format whereas in the class the datasets were in XML or TXT format. You need to figure out how to load data from a JSON file to a Hive table. 
* You need to infer the schema by looking at the data. 

Hints: 

* Some of the columns are of array type. For example, you should use array&lt;STRING&gt; for friends column.
* The size of the dataset is large (about 1GB). You may want to create a smaller dataset first and work this smaller dataset until you develop and test your code, and then apply it on the original dataset.


### 2. Simple queries

Having created the Hive table and loaded the data into it, write and execute queries to:

i. retrieve the schema;

ii. show the number of rows in the table;

iii. select top 10 users who have provided the largest number of reviews (the output should consist of the user name and the number of reviews of the users).

For all the queries, please show both the commands you used and the output. You may copy and paste the commands that you run and the outputs, or provide screenshots.

## Answers:


In [None]:
# C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud config set project project-longwind48
Updated property [core/project].

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud config list
[compute]
region = europe-west1
zone = europe-west1-d
[core]
account = longwind48@gmail.com
disable_usage_reporting = True
project = project-longwind48

Your active configuration is: [default]

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gsutil mb gs://tracilim-bucket/
Creating gs://tracilim-bucket/...

Upload the dataset into my bucket.

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gsutil cp C:\hduser\yelp_academic_dataset_user.json gs://tracilim-bucket/data/yelp_academic_dataset_user.json
Copying file://C:\hduser\yelp_academic_dataset_user.json [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

\ [1 files][  1.1 GiB/  1.1 GiB]    9.6 MiB/s
Operation completed over 1 objects/1.1 GiB.

Create a cluster named 'mycluster'

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud dataproc clusters create mycluster --project project-longwind48 --bucket tracilim-bucket
API [dataproc.googleapis.com] not enabled on project
[project-longwind48]. Would you like to enable and retry?  (y/N)?  y

Enabling service dataproc.googleapis.com on project project-longwind48...
Waiting for async operation operations/tmo-acf.2cb981fd-cbc2-419e-aae5-70026f4c7f3c to complete...
Operation finished successfully. The following command can describe the Operation details:
 gcloud services operations describe operations/tmo-acf.2cb981fd-cbc2-419e-aae5-70026f4c7f3c
Waiting on operation [projects/project-longwind48/regions/global/operations/9b1e37cc-5500-47d6-b815-9d431ef9cba0].
Waiting for cluster creation operation...done.
Created [https://dataproc.googleapis.com/v1/projects/project-longwind48/regions/global/clusters/mycluster] Cluster placed in zone [europe-west1-d].

Create a table.

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud dataproc jobs submit hive --cluster mycluster -e "create table json_yelp1 (json string);"
Job [de2fffc3-71d3-42e2-a3c1-368b37c1d043] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://mycluster-m:10000
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
18/02/14 17:40:49 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
No rows affected (0.144 seconds)
Beeline version 2.1.1 by Apache Hive
Closing: 0: jdbc:hive2://mycluster-m:10000
Job [de2fffc3-71d3-42e2-a3c1-368b37c1d043] finished successfully.
driverControlFilesUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/de2fffc3-71d3-42e2-a3c1-368b37c1d043/
driverOutputResourceUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/de2fffc3-71d3-42e2-a3c1-368b37c1d043/driveroutput
hiveJob:
  queryList:
    queries:
    - create table json_yelp1 (json string);
placement:
  clusterName: mycluster
  clusterUuid: d8d61f95-861b-42e2-9f8f-b8222e02f5a0
reference:
  jobId: de2fffc3-71d3-42e2-a3c1-368b37c1d043
  projectId: project-longwind48
status:
  state: DONE
  stateStartTime: '2018-02-14T17:40:51.047Z'
statusHistory:
- state: PENDING
  stateStartTime: '2018-02-14T17:40:43.220Z'
- state: SETUP_DONE
  stateStartTime: '2018-02-14T17:40:44.095Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2018-02-14T17:40:45.164Z'
  

Load data into the table.

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud dataproc jobs submit hive --cluster mycluster -e "load data inpath 'gs://tracilim-bucket/data/yelp_academic_dataset_user.json' INTO TABLE json_yelp1;"
Job [89038e2d-14b0-476d-94a3-51344f891aea] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://mycluster-m:10000
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
18/02/14 17:42:37 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: Error while compiling statement: FAILED: SemanticException [Error 10028]: Line 1:23 Path is not legal ''gs://tracilim-bucket/data/yelp_academic_dataset_user.json'': Source file system should be "file" if "local" is specified (state=42000,code=10028)
Closing: 0: jdbc:hive2://mycluster-m:10000
ERROR: (gcloud.dataproc.jobs.submit.hive) Job [89038e2d-14b0-476d-94a3-51344f891aea] entered state [ERROR] while waiting for [DONE].

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud dataproc jobs submit hive --cluster mycluster -e "describe json_yelp1;"
Job [7677d21c-0f6a-469e-89e0-d3efdd5a6e59] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://mycluster-m:10000
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
18/02/14 17:43:54 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
+-----------+------------+----------+--+
| col_name  | data_type  | comment  |
+-----------+------------+----------+--+
| json      | string     |          |
+-----------+------------+----------+--+
1 row selected (0.132 seconds)
Beeline version 2.1.1 by Apache Hive
Closing: 0: jdbc:hive2://mycluster-m:10000
Job [7677d21c-0f6a-469e-89e0-d3efdd5a6e59] finished successfully.
driverControlFilesUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/7677d21c-0f6a-469e-89e0-d3efdd5a6e59/
driverOutputResourceUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/7677d21c-0f6a-469e-89e0-d3efdd5a6e59/driveroutput
hiveJob:
  queryList:
    queries:
    - describe json_yelp1;
placement:
  clusterName: mycluster
  clusterUuid: d8d61f95-861b-42e2-9f8f-b8222e02f5a0
reference:
  jobId: 7677d21c-0f6a-469e-89e0-d3efdd5a6e59
  projectId: project-longwind48
status:
  state: DONE
  stateStartTime: '2018-02-14T17:43:55.924Z'
statusHistory:
- state: PENDING
  stateStartTime: '2018-02-14T17:43:48.710Z'
- state: SETUP_DONE
  stateStartTime: '2018-02-14T17:43:49.231Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2018-02-14T17:43:49.948Z'

Show the number of rows in the table: 1029432 rows

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud dataproc jobs submit hive --cluster mycluster -e "select count(*) from json_yelp1;"
Job [b07ef972-af76-49a1-aca4-012eb0b38909] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://mycluster-m:10000
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
18/02/14 17:46:33 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+----------+--+
|    c0    |
+----------+--+
| 1029432  |
+----------+--+
1 row selected (31.176 seconds)
Beeline version 2.1.1 by Apache Hive
Closing: 0: jdbc:hive2://mycluster-m:10000
Job [b07ef972-af76-49a1-aca4-012eb0b38909] finished successfully.
driverControlFilesUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/b07ef972-af76-49a1-aca4-012eb0b38909/
driverOutputResourceUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/b07ef972-af76-49a1-aca4-012eb0b38909/driveroutput
hiveJob:
  queryList:
    queries:
    - select count(*) from json_yelp1;
placement:
  clusterName: mycluster
  clusterUuid: d8d61f95-861b-42e2-9f8f-b8222e02f5a0
reference:
  jobId: b07ef972-af76-49a1-aca4-012eb0b38909
  projectId: project-longwind48
status:
  state: DONE
  stateStartTime: '2018-02-14T17:47:08.717Z'
statusHistory:
- state: PENDING
  stateStartTime: '2018-02-14T17:46:27.135Z'
- state: SETUP_DONE
  stateStartTime: '2018-02-14T17:46:27.652Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2018-02-14T17:46:28.746Z'
yarnApplications:
- name: select count(*) from json_yelp1(Stage-1)
  progress: 1.0
  state: FINISHED
  trackingUrl: http://mycluster-m:8088/proxy/application_1518622563983_0001/

Select top 10 users who have provided the largest number of reviews

In [None]:
C:\Users\longwind48\AppData\Local\Google\Cloud SDK>gcloud dataproc jobs submit hive --cluster mycluster -e "select get_json_object(json_yelp1.json, '$.name') as name, get_json_object(json_yelp1.json, '$.review_count') as review_count from json_yelp1 ORDER BY review_count DESC LIMIT 10;"
Job [239dd811-e573-4014-ba16-ab308b6ccc48] submitted.
Waiting for job output...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://mycluster-m:10000
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
18/02/14 17:59:38 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
+-----------+---------------+--+
|   name    | review_count  |
+-----------+---------------+--+
| Michelle  | 999           |
| claudia   | 997           |
| Julie     | 997           |
| Kathy     | 996           |
| Dan       | 996           |
| Nobbi     | 996           |
| Allison   | 996           |
| Meghan    | 996           |
| Tiffany   | 994           |
| Tiffany   | 994           |
+-----------+---------------+--+
10 rows selected (32.771 seconds)
Beeline version 2.1.1 by Apache Hive
Closing: 0: jdbc:hive2://mycluster-m:10000
Job [239dd811-e573-4014-ba16-ab308b6ccc48] finished successfully.
driverControlFilesUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/239dd811-e573-4014-ba16-ab308b6ccc48/
driverOutputResourceUri: gs://tracilim-bucket/google-cloud-dataproc-metainfo/d8d61f95-861b-42e2-9f8f-b8222e02f5a0/jobs/239dd811-e573-4014-ba16-ab308b6ccc48/driveroutput
hiveJob:
  queryList:
    queries:
    - select get_json_object(json_yelp1.json, '$.name') as name, get_json_object(json_yelp1.json,
      '$.review_count') as review_count from json_yelp1 ORDER BY review_count DESC
      LIMIT 10;
placement:
  clusterName: mycluster
  clusterUuid: d8d61f95-861b-42e2-9f8f-b8222e02f5a0
reference:
  jobId: 239dd811-e573-4014-ba16-ab308b6ccc48
  projectId: project-longwind48
status:
  state: DONE
  stateStartTime: '2018-02-14T18:00:14.090Z'
statusHistory:
- state: PENDING
  stateStartTime: '2018-02-14T17:59:32.313Z'
- state: SETUP_DONE
  stateStartTime: '2018-02-14T17:59:33.144Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2018-02-14T17:59:33.881Z'
yarnApplications:
- name: select get_json_object(json_yelp1.json,...10(Stage-1)
  progress: 1.0
  state: FINISHED
  trackingUrl: http://mycluster-m:8088/proxy/application_1518622563983_0003/