
# Databricks Coding Challenge - SQL
### Note: All questions should be done using SQL language

## Spark SQL and DataFrames 

In this section, you'll read in data to create a DataFrame in Spark.  We'll be reading in a dataset stored in the Databricks File System (DBFS).  Please see this [link](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#databricks-file-system-dbfs) for more details on how to use DBFS.

##Understanding the data set 

###Overview:
The data set used throughout the coding assessment resembles telemetry data that any software as a service (SaaS) company might collect. One record represents the node hours for a single workload running on a transient cluster aggregated at the date and workload type level. This data set may be used to help Databricks understand consumption patterns and user behaviors on our platform. For example, we can inspect this data to understand if a given customer prefers our `automated` or `interactive` features, or understand which AWS instance types are preferred among all of our customers. 

###Format: 
 * JSON
 * Resides on S3

###Schema:
* date (String)
* nodeHours (Double)
* workloadType (String) (read more [here](https://databricks.com/product/aws-pricing#clusters))
* metadata (Struct)
 * clusterMetadata (Struct): Describes the cluster configuration
 * runtimeMetadata (Struct): Describes the software configuration
 * workloadMetadata (Struct): Describes the customer. Each shard may have one or many workspaces and each workspace may have zero or many clusters 



### Part A: SparkSQL and Dataframes 

In this section, you'll read in data to create a dataframe in Spark.  We'll be reading in a dataset stored in the Databricks File System (DBFS).  Please see this link for more details on how to use DBFS:
https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#databricks-file-system-dbfs

Execute the command below to list the files in a directory that you will be analyzing.  There are several files in this test dataset.

In [0]:
%fs ls /databricks-coding-challenge/workloads

In [0]:
%fs head dbfs:/databricks-coding-challenge/workloads/part-00000-tid-7467717951814126607-30bac750-dd23-4160-a2a6-e57064ff0dc6-1506091-1-c000.json


### Question 1 (15 points):
Please create a temporary Spark SQL view called "workloads" from the json files in the directory listed up above

In [0]:
%python
'''
#Reading from json directory
#creating a tempview
#displaying the dataframe
'''
workloads_df = spark.read.json("dbfs:/databricks-coding-challenge/workloads") 
workloads_df.createOrReplaceTempView('workloads')                         
spark.sql("select * from workloads").display()


What is the schema for this table?

In [0]:
%python
workloads_df.printSchema() #printing the schema


### Question 2 (15 points):

Please print out all the unique workspaceId's for this dataset and order them such that workspaceId's are increasing in number.


### Question 3 (15 points):

What is the number of unique clusters in this data set?  A cluster is identified by the `metadata.workloadMetadata.clusterId` field.

In [0]:
-- TODO
-- Count all the distinct cluster IDs
select count(distinct metadata.workloadMetadata.clusterId) as no_of_unique_clusters from workloads;

### Question 4 (15 points): 
What is the number of workload hours each day for the workspaceID - `-9014487477555684744`?

In [0]:
-- TODO
-- List all the distinct workspace IDs by increasing order
select date as day,round(sum(nodeHours),2) as workload_hours from workloads where metadata.workloadMetadata.workspaceId = '-9014487477555684744' group by date order by 1;


Determine how many nodes are spot vs. on demand for a given cluster.

In [0]:
-- TODO
select metadata.workloadMetadata.clusterId,sum(case when metadata.clusterMetadata.containerIsSpot = 'true' then 1 else 0 end) as spot,sum(case when metadata.clusterMetadata.containerIsSpot = 'false' then 1 else 0 end) as on_demand from workloads group by 1;


### Question 5 (15 points): 

How many interactive node hours per day are there on the different Spark versions over time.

In [0]:
-- TODO
select date as day,metadata.runtimeMetadata.sparkVersion,round(sum(nodeHours),2) as interactive_node_hours from workloads where workloadType = 'interactive' group by 1,2 order by 1

### Question 6 (25 points):
#### TPC-H Dataset
You're provided with a Line Items records from the TPC-H data set. The data is located in `/databricks-datasets/tpch/data-001/lineitem`.
Find the top two most recently shipped (shipDate) Line Items per Part using the simplest and most efficient approach.

You're free to use any combinations of SparkSQL, PySpark or Scala Spark to answer this challenge.

![](https://docs.deistercloud.com/mediaContent/Databases.30/TPCH%20Benchmark.90/media/tpch_schema.png?v=0)

In [0]:
%python
src ='/databricks-datasets/tpch/data-001/lineitem/lineitem.tbl'
schema =", ".join(['orderkey int', 'partkey int', 'suppkey int', 'lineNumber int', 'quantity int', 'extendedprice float', 'discount float', 'tax float', 'returnflag string', 'linestatus string', 'shipdate date', 'commitdate date', 'receiptdate date', 'shipinstruct string', 'shipmode string', 'comment string'])
tpc_h = (spark.read.format("csv") 
          .schema(schema)
          .option("header", False)
          .option("sep", "|")
          #.option("inferSchema", True)
          .load(src)
        )


In [0]:
%python
dbutils.fs.head('/databricks-datasets/tpch/data-001/lineitem/lineitem.tbl')

In [0]:
%python
display(tpc_h)

In [0]:
%python
tpc_h.createOrReplaceTempView('tpc_h')

Find the top two most recently shipped (shipDate) Line Items per Part using the simplest and most efficient approach.