<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://cdn2.hubspot.net/hubfs/438089/docs/training/dblearning-banner.png" alt="Databricks Learning" width="555" height="64">
</div>

&copy; 2018 Databricks, Inc. All rights reserved.<br/>

### Why Contoso Needs a Data Lake
* Store both big and small data in one location for all personas - Data Engineering, Data Science, Analysts 
* They need to access this data in diffrent languages and tools - SQL, Python, Scala, Java, R with Notebooks, IDE, Power BI, Tableau, JDBC/ODBC

### Azure Databricks Solutions
* Azure Storage or Azure Data Lake - Is a place to store all data, big and small
* Access both big (TB to PB) and small data easily with Databricks' scaleable clusters
* Use Python, Scala, R, SQL, Java

####Azure Databricks for Batch ETL & Data Engineers 

![arch](https://kpistoropen.blob.core.windows.net/collateral/roadshow/azure_roadshow_de.png)

# Reading & Writting Data to/from files - Parquet and CSV

**Technical Accomplishments:**
- Read data from CSV 
- Write out data in optimal Parquet format with Schema

##![Spark Logo Tiny](https://kpistoropen.blob.core.windows.net/collateral/roadshow/logo_spark_tiny.png) Reading from CSV

### The Data Source
* For this exercise, we will be using a file called **products.csv**.
* The data represents new products we are planning to add to our online store.
* We can use **&percnt;head ...** to view the first few lines of the file.

In [8]:
%fs ls /mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/

path,name,size
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/product.csv,product.csv,3449


In [9]:
%fs head /mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/product.csv

### Option #1 - Read The CSV File, creating a DF
Let's start with the bare minimum by specifying that the file we want to read is delimited and the location of the file:
The default delimiter for `spark.read.csv( )` is comma but we can change by specifying the option delimiter parameter.

In [11]:
csvFile = "/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/product.csv"

In [12]:
df = (spark.read                        # The DataFrameReader
   .option("header", "true")       # Use first line of all files as header
   .option("inferSchema", "true")  # Automatically infer data types
   .csv(csvFile)                   # Creates a DataFrame from CSV after reading in the file
)

In [13]:
display(df)

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,


In [14]:
(df1,df2) = df.randomSplit([0.8,0.2])
display(df2)

product_id,category,brand,model,price,processor,size,display
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
11,Laptops,Dell,"""2-in-1 17.3"""" Touch-Screen Laptop""",1299.989990234375,,,
17,Laptops,Microsoft,"""Surface Book 13.5"""" Touch Screen with Performance Base""",2399.989990234375,,,
19,Laptops,Microsoft,Surface Pro – 12.3”,2699.989990234375,,,
27,tablets,HP&reg;,Spectre x2 (2017),299.99,Core-i,"12.3""""""""",3000×2000 (3:2)
30,tablets,Samsung,"""""""Galaxy Book 12"""""""" """"""",299.99,Core-i,"12.0""""""""",2160×1440 (3:2)
31,tablets,Samsung,"""""""Galaxy Book 10.6"""""""" """"""",299.99,Core-m,"10.6""""""""",1920×1280 (3:2)
36,tablets,Lenovo&reg;,Miix 720,299.99,Core-i,"12.0""""""""",2880×1920 (3:2)


### Option #2 Create a SQL view, inferring the schema

In [16]:
%sql

DROP TABLE IF EXISTS contoso_products; 
CREATE OR REPLACE TEMPORARY VIEW contoso_products
USING CSV
OPTIONS (path "/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/", header "true", inferSchema "true")

In [17]:
%sql

SELECT * FROM contoso_products

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,


Create a persistent table

In [19]:
%sql
DROP TABLE IF EXISTS contoso_products_persistent;
CREATE Table contoso_products_persistent
USING CSV
OPTIONS (path "/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/", header "true", inferSchema "true")

Now one might think that this would actually print out the values of the `DataFrame` that we just parallelized, however that's not quite how Apache Spark works. Spark allows two distinct kinds of operations by the user. There are **transformations** and there are **actions**.

### Transformations

Transformations are operations that will not be completed at the time you write and execute the code in a cell - they will only get executed once you have called a **action**. An example of a transformation might be to convert an integer into a float or to filter a set of values.

### Actions

Actions are commands that are computed by Spark right at the time of their execution. They consist of running all of the previous transformations in order to get back an actual result. An action is composed of one or more jobs which consists of tasks that will be executed by the workers in parallel where possible

Here are some simple examples of transformations and actions. Remember, these **are not all** the transformations and actions - this is just a short sample of them. We'll get to why Apache Spark is designed this way shortly!

![transformations and actions](https://training.databricks.com/databricks_guide/gentle_introduction/trans_and_actions.png)

This can be costly when reading in a large file as spark is forced to read through all the data in the files in order to determine data types.  To read in a file and avoid this costly extra job we can provide the schema to the DataFrameReader.

### Option #3 Create a Dataframe with a given schema

This time we are going to read the same file.

The difference here is that we are going to define the schema beforehand to avoid the execution of any extra jobs.

Declare the schema.

This is just a list of field names and data types.

In [24]:
# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("product_id", LongType(), True),
  StructField("category", StringType(), True),
  StructField("brand", StringType(), True),
  StructField("model", StringType(), True),
  StructField("price", DoubleType(), True),
  StructField("processor", StringType(), True),
  StructField("size", StringType(), True),
  StructField("display", StringType(), True)
 ])

Read in our data (and print the schema).

We can specify the schema, or rather the `StructType`, with the `schema(..)` command:

In [26]:
productDF = (spark.read                   # The DataFrameReader
  .option('header', 'true')   # Ignore line #1 - it's a header
  .schema(csvSchema)          # Use the specified schema
  .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
)

With our DataFrame created, we can now create a temporary view and then view the data via SQL:

In [28]:
display(productDF)

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,


In [29]:
# create a view called contoso_products_df
productDF.createOrReplaceTempView("contoso_products_df")

In [30]:
%sql
SELECT * FROM contoso_products_df LIMIT 10

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,


In [31]:
productDF.count()

In [32]:
%sql
select count(*) from contoso_products_df

count(1)
40


And now we can take a peak at the data with simple SQL SELECT statement:

<h2 style="color:green">Tip</h2>

* Switch Languages (SQL, Scala, R, Shell, File System)
* JDBC/ODBC!

<h2 style="color:green">Collaboration example</h2>

* Live comments
* Revision history

In [36]:
%sql
SELECT * FROM contoso_products

product_id,category,brand,model,price,processor,size,display
1,Laptops,HP,"""Spectre x360 2-in-1 13.3"""" 4K Ultra HD Touch-Screen Laptop""",1499.989990234375,,,
2,Laptops,Microsoft,"Surface Pro – 12.3""""",1299.989990234375,,,
3,Laptops,Microsoft,"Surface Book 2 - 13.5""""",1499.989990234375,,,
4,Laptops,Dell,"XPS 2-in-1 13.3""""",1949.989990234375,,,
5,Laptops,Lenovo,"Yoga 920 2-in-1 13.9""""",1799.989990234375,,,
6,Laptops,Apple,"""MacBook Pro - 15"""" Display""",2659.989990234375,,,
7,Laptops,Apple,"""MacBook Pro - 13"""" Display""",1499.989990234375,,,
8,Laptops,Apple,"""MacBook Pro - 15.4"""" Display""",1999.989990234375,,,
9,Laptops,Apple,"""MacBook Air - 13.3"""" Display""",999.989990234375,,,
10,Laptops,HP,"""Pavilion x360 2-in-1 14"""" Touch-Screen Laptop""",999.989990234375,,,


### Option #4 Create a SQL table w/ user-defined schema

In [38]:
%sql 
DROP TABLE IF EXISTS contoso_products_manual;
CREATE TABLE contoso_products_manual (
  product_id int, 
  category string, 
  brand string, 
  model string, 
  price double,
  processor string,
  size string,
  display string)
USING CSV
OPTIONS (path "/mnt/databricks-workshop-datasets/Contoso-retail/initech/productsCsv/", header "true")

In [39]:
df_from_sql = table("contoso_products_manual")

##![Spark Logo Tiny](https://kpistoropen.blob.core.windows.net/collateral/roadshow/logo_spark_tiny.png) Writing to Parquet

* Parquet is a file format that is supported by many other data processing systems. 

* Parquet files are a columnar file format that Databricks highly recommends for customers. 

* Parquet files provide optimizations under the hood to speed up queries and are far more efficient file format than csv or json.

* Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

More discussion on <a href="http://parquet.apache.org/documentation/latest/" target="_blank">Parquet</a>

Documentation on <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20reader#pyspark.sql.DataFrameReader" target="_blank">DataFrameReader</a>

In [41]:
#Write to the file store
writeBase = "dbfs:/workshop/"

In [42]:
#Define path where to write to -- by default, in this workshop, we write to the workspace filestore
writeBase = "dbfs:/workshop/"
writePath = writeBase + "contoso_products.parquet"

#If there are multiple users working on the same instance, please use this writeBase, adding your $USERNAME to the path, and to any subsequent write/read
#writeBase = writeBase = "dbfs:/workshop/$USERNAME/"
#writePath = writeBase + "contoso_products.parquet"

#As backup, you can always write to this blob
#writeBase = "dbfs:/mnt/databricks-workshop-exercises/Contoso-retail/initech/"
#writePath = writeBase + "contoso_products.parquet"

In [43]:
productDF.write.mode("overwrite").parquet(writePath)   

we can see the parquet file in the file system

In [45]:
print(writePath)

In [46]:
%fs ls dbfs:/workshop/contoso_products.parquet

path,name,size
dbfs:/workshop/contoso_products.parquet/_SUCCESS,_SUCCESS,0
dbfs:/workshop/contoso_products.parquet/_committed_2556813255054112978,_committed_2556813255054112978,228
dbfs:/workshop/contoso_products.parquet/_committed_4700619820145654838,_committed_4700619820145654838,121
dbfs:/workshop/contoso_products.parquet/_committed_vacuum3576009207923724906,_committed_vacuum3576009207923724906,96
dbfs:/workshop/contoso_products.parquet/_started_2556813255054112978,_started_2556813255054112978,0
dbfs:/workshop/contoso_products.parquet/part-00000-tid-2556813255054112978-6aeb910f-0112-4a7c-a6f6-a4cc52b9e7f2-18-c000.snappy.parquet,part-00000-tid-2556813255054112978-6aeb910f-0112-4a7c-a6f6-a4cc52b9e7f2-18-c000.snappy.parquet,3759


##![Spark Logo Tiny](https://kpistoropen.blob.core.windows.net/collateral/roadshow/logo_spark_tiny.png) Writing to Parquet using SQL

In [48]:
%sql 
DROP TABLE IF EXISTS contoso_products_parquet;
CREATE TABLE contoso_products_parquet
USING parquet
-- this path might need to be changed based on where you have been writing to
OPTIONS (path = "dbfs:/workshop/sqlwrite/contoso_products.parquet")
AS 
SELECT * FROM contoso_products

In [49]:
%fs ls dbfs:/workshop/sqlwrite/contoso_products.parquet

path,name,size
dbfs:/workshop/sqlwrite/contoso_products.parquet/_SUCCESS,_SUCCESS,0
dbfs:/workshop/sqlwrite/contoso_products.parquet/_committed_1352254798129581626,_committed_1352254798129581626,228
dbfs:/workshop/sqlwrite/contoso_products.parquet/_committed_5288512592291700470,_committed_5288512592291700470,121
dbfs:/workshop/sqlwrite/contoso_products.parquet/_committed_vacuum6023087286022087975,_committed_vacuum6023087286022087975,96
dbfs:/workshop/sqlwrite/contoso_products.parquet/_started_1352254798129581626,_started_1352254798129581626,0
dbfs:/workshop/sqlwrite/contoso_products.parquet/part-00000-tid-1352254798129581626-aad307ca-85c1-499b-bc9b-41e505a9c7db-19-c000.snappy.parquet,part-00000-tid-1352254798129581626-aad307ca-85c1-499b-bc9b-41e505a9c7db-19-c000.snappy.parquet,3641


## Next Step

[Transformations-Actions]($../2-ETL/2-02 Transformations-Actions)

&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>