-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Reader & Writer
##### Objectives
1. Read from CSV files
1. Read from JSON files
1. Write DataFrame to files
1. Write DataFrame to tables
1. Write DataFrame to a Delta table

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">DataFrameReader</a>: `csv`, `json`, `option`, `schema`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">DataFrameWriter</a>: `mode`, `option`, `parquet`, `format`, `saveAsTable`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.StructType.html#pyspark.sql.types.StructType" target="_blank">StructType</a>: `toDDL`

##### Spark Types
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#data-types" target="_blank">Types</a>: `ArrayType`, `DoubleType`, `IntegerType`, `LongType`, `StringType`, `StructType`, `StructField`

In [0]:
%run ./Includes/Classroom-Setup

## DataFrameReader
Interface used to load a DataFrame from external storage systems

```
spark.read.parquet("path/to/files")
```

DataFrameReader is accessible through the SparkSession attribute `read`. This class includes methods to load DataFrames from different external storage systems.

### Read from CSV files
Read from CSV with the DataFrameReader's `csv` method and the following options:

Tab separator, use first line as header, infer schema

In [0]:
usersCsvPath = "/mnt/training/ecommerce/users/users-500k.csv"

usersDF = (spark
           .read
           .option("sep", "\t")
           .option("header", True)
           .option("inferSchema", True)
           .csv(usersCsvPath)
          )

usersDF.printSchema()

Spark's Python API also allows you to specify the DataFrameReader options as parameters to the `csv` method

In [0]:
usersDF = (spark
           .read
           .csv(usersCsvPath, sep="\t", header=True, inferSchema=True)
          )

usersDF.printSchema()

Manually define the schema by creating a `StructType` with column names and data types

### defining schema saves execution time(Must to do, to save some money!!)

In [0]:
from pyspark.sql.types import LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
    StructField("user_id", StringType(), True),
    StructField("user_first_touch_timestamp", LongType(), True),
    StructField("email", StringType(), True)
])

Read from CSV using this user-defined schema instead of inferring the schema

In [0]:
usersDF = (spark
           .read
           .option("sep", "\t")
           .option("header", True)
           .schema(userDefinedSchema)
           .csv(usersCsvPath)
          )

Alternatively, define the schema using <a href="https://en.wikipedia.org/wiki/Data_definition_language" target="_blank">data definition language (DDL)</a> syntax.

In [0]:
DDLSchema = "user_id string, user_first_touch_timestamp long, email string"

usersDF = (spark
           .read
           .option("sep", "\t")
           .option("header", True)
           .schema(DDLSchema)
           .csv(usersCsvPath)
          )

### Read from JSON files

Read from JSON with DataFrameReader's `json` method and the infer schema option

In [0]:
eventsJsonPath = "/mnt/training/ecommerce/events/events-500k.json"

eventsDF = (spark
            .read
            .option("inferSchema", True)
            .json(eventsJsonPath)
           )

eventsDF.printSchema()

Read data faster by creating a `StructType` with the schema names and data types

In [0]:
from pyspark.sql.types import ArrayType, DoubleType, IntegerType, LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
    StructField("device", StringType(), True),
    StructField("ecommerce", StructType([
        StructField("purchaseRevenue", DoubleType(), True),
        StructField("total_item_quantity", LongType(), True),
        StructField("unique_items", LongType(), True)
    ]), True),
    StructField("event_name", StringType(), True),
    StructField("event_previous_timestamp", LongType(), True),
    StructField("event_timestamp", LongType(), True),
    StructField("geo", StructType([
        StructField("city", StringType(), True),
        StructField("state", StringType(), True)
    ]), True),
    StructField("items", ArrayType(
        StructType([
            StructField("coupon", StringType(), True),
            StructField("item_id", StringType(), True),
            StructField("item_name", StringType(), True),
            StructField("item_revenue_in_usd", DoubleType(), True),
            StructField("price_in_usd", DoubleType(), True),
            StructField("quantity", LongType(), True)
        ])
    ), True),
    StructField("traffic_source", StringType(), True),
    StructField("user_first_touch_timestamp", LongType(), True),
    StructField("user_id", StringType(), True)
])

eventsDF = (spark
            .read
            .schema(userDefinedSchema)
            .json(eventsJsonPath)
           )

You can use the `StructType` Scala method `toDDL` to have a DDL-formatted string created for you.

In a Python notebook, create a Scala cell to create the string to copy and paste.

In [0]:
%scala
spark.read.parquet("/mnt/training/ecommerce/events/events.parquet").schema.toDDL

In [0]:
DDLSchema = "`device` STRING,`ecommerce` STRUCT<`purchase_revenue_in_usd`: DOUBLE, `total_item_quantity`: BIGINT, `unique_items`: BIGINT>,`event_name` STRING,`event_previous_timestamp` BIGINT,`event_timestamp` BIGINT,`geo` STRUCT<`city`: STRING, `state`: STRING>,`items` ARRAY<STRUCT<`coupon`: STRING, `item_id`: STRING, `item_name`: STRING, `item_revenue_in_usd`: DOUBLE, `price_in_usd`: DOUBLE, `quantity`: BIGINT>>,`traffic_source` STRING,`user_first_touch_timestamp` BIGINT,`user_id` STRING"

eventsDF = (spark
            .read
            .schema(DDLSchema)
            .json(eventsJsonPath)
           )

## DataFrameWriter
Interface used to write a DataFrame to external storage systems

```
(df.write                         
  .option("compression", "snappy")
  .mode("overwrite")      
  .parquet(outPath)       
)
```

DataFrameWriter is accessible through the SparkSession attribute `write`. This class includes methods to write DataFrames to different external storage systems.

### Write DataFrames to files

Write `usersDF` to parquet with DataFrameWriter's `parquet` method and the following configurations:

Snappy compression, overwrite mode

In [0]:
workingDir

In [0]:
usersOutputPath = workingDir + "/users.parquet"

(usersDF
 .write
 .option("compression", "snappy")
 .mode("overwrite")
 .parquet(usersOutputPath)
)

In [0]:
display(
    dbutils.fs.ls(usersOutputPath)
)

path,name,size,modificationTime
dbfs:/user/anurag@celebaltech.com/dbacademy/spark_programming/asp_1_4_reader_writer/users.parquet/_SUCCESS,_SUCCESS,0,1658408209000
dbfs:/user/anurag@celebaltech.com/dbacademy/spark_programming/asp_1_4_reader_writer/users.parquet/_committed_4823694437460818663,_committed_4823694437460818663,432,1658408209000
dbfs:/user/anurag@celebaltech.com/dbacademy/spark_programming/asp_1_4_reader_writer/users.parquet/_started_4823694437460818663,_started_4823694437460818663,0,1658408208000
dbfs:/user/anurag@celebaltech.com/dbacademy/spark_programming/asp_1_4_reader_writer/users.parquet/part-00000-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11602-1-c000.snappy.parquet,part-00000-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11602-1-c000.snappy.parquet,2075044,1658408209000
dbfs:/user/anurag@celebaltech.com/dbacademy/spark_programming/asp_1_4_reader_writer/users.parquet/part-00001-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11603-1-c000.snappy.parquet,part-00001-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11603-1-c000.snappy.parquet,2074022,1658408209000
dbfs:/user/anurag@celebaltech.com/dbacademy/spark_programming/asp_1_4_reader_writer/users.parquet/part-00002-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11604-1-c000.snappy.parquet,part-00002-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11604-1-c000.snappy.parquet,2074427,1658408209000
dbfs:/user/anurag@celebaltech.com/dbacademy/spark_programming/asp_1_4_reader_writer/users.parquet/part-00003-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11605-1-c000.snappy.parquet,part-00003-tid-4823694437460818663-74a01f3c-2bde-4a4d-819f-ae8f40d5b759-11605-1-c000.snappy.parquet,672136,1658408208000


As with DataFrameReader, Spark's Python API also allows you to specify the DataFrameWriter options as parameters to the `parquet` method

In [0]:
(usersDF
 .write
 .parquet(usersOutputPath, compression="snappy", mode="overwrite")
)

### Write DataFrames to tables

Write `eventsDF` to a table using the DataFrameWriter method `saveAsTable`

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> This creates a global table, unlike the local view created by the DataFrame method `createOrReplaceTempView`

In [0]:
eventsDF.write.mode("overwrite").saveAsTable("events_p")

This table was saved in the database created for you in classroom setup. See database name printed below.

In [0]:
print(databaseName)

## Delta Lake

In almost all cases, the best practice is to use Delta Lake format, especially whenever the data will be referenced from a Databricks workspace. 

<a href="https://delta.io/" target="_blank">Delta Lake</a> is an open source technology designed to work with Spark to bring reliability to data lakes.

![delta](https://files.training.databricks.com/images/aspwd/delta_storage_layer.png)

#### Delta Lake's Key Features
- ACID transactions
- Scalable metadata handline
- Unified streaming and batch processing
- Time travel (data versioning)
- Schema enforcement and evolution
- Audit history
- Parquet format
- Compatible with Apache Spark API

### Write Results to a Delta Table

Write `eventsDF` with the DataFrameWriter's `save` method and the following configurations: Delta format, overwrite mode

In [0]:
eventsOutputPath = workingDir + "/delta/events"

(eventsDF
 .write
 .format("delta")
 .mode("overwrite")
 .save(eventsOutputPath)
)

# Ingesting Data Lab

Read in CSV files containing products data.

##### Tasks
1. Read with infer schema
2. Read with user-defined schema
3. Read with schema as DDL formatted string
4. Write using Delta format

### 1. Read with infer schema
- View the first CSV file using DBUtils method `fs.head` with the filepath provided in the variable `singleProductCsvFilePath`
- Create `productsDF` by reading from CSV files located in the filepath provided in the variable `productsCsvPath`
  - Configure options to use first line as header and infer schema

In [0]:
# TODO
singleProductCsvFilePath = "/mnt/training/ecommerce/products/products.csv/part-00000-tid-1663954264736839188-daf30e86-5967-4173-b9ae-d1481d3506db-2367-1-c000.csv"

print(singleProductCsvFilePath)

productsCsvPath = "/mnt/training/ecommerce/products/products.csv"

productsDF = (spark
           .read
           .option("sep", ",")
           .option("header", True)
           .option("inferSchema", True)
           .csv(productsCsvPath)
          )

display(productsDF)

item_id,name,price
M_STAN_Q,Standard Queen Mattress,1045.0
M_STAN_K,Standard King Mattress,1195.0
M_STAN_T,Standard Twin Mattress,595.0
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0
M_PREM_T,Premium Twin Mattress,1095.0
M_PREM_K,Premium King Mattress,1995.0
P_DOWN_S,Standard Down Pillow,119.0
P_FOAM_S,Standard Foam Pillow,59.0


**CHECK YOUR WORK**

In [0]:
assert(productsDF.count() == 12)

### 2. Read with user-defined schema
Define schema by creating a `StructType` with column names and data types

In [0]:
# TODO
userDefinedSchema = StructType([
    StructField("item_id", StringType(), True),
    StructField("name", LongType(), True),
    StructField("price", StringType(), True)
])


productsDF2 = (spark
           .read
           .option("sep", ",")
           .option("header", True)
           .schema(userDefinedSchema)
           .csv(productsCsvPath)
          )

display(productsDF)

item_id,name,price
M_STAN_Q,Standard Queen Mattress,1045.0
M_STAN_K,Standard King Mattress,1195.0
M_STAN_T,Standard Twin Mattress,595.0
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0
M_PREM_T,Premium Twin Mattress,1095.0
M_PREM_K,Premium King Mattress,1995.0
P_DOWN_S,Standard Down Pillow,119.0
P_FOAM_S,Standard Foam Pillow,59.0


**CHECK YOUR WORK**

In [0]:
assert(userDefinedSchema.fieldNames() == ["item_id", "name", "price"])

In [0]:
from pyspark.sql import Row

expected1 = Row(item_id="M_STAN_Q", name="Standard Queen Mattress", price=1045.0)
result1 = productsDF2.first()

assert(expected1 == result1)

### 3. Read with DDL formatted string

In [0]:
# TODO
DDLSchema = "item_id string, name string, price long"

productsDF3 = (spark
           .read
           .option("sep", ",")
           .option("header", True)
           .schema(DDLSchema)
           .csv(productsCsvPath)
          )

display(productsDF)

item_id,name,price
M_STAN_Q,Standard Queen Mattress,1045.0
M_STAN_K,Standard King Mattress,1195.0
M_STAN_T,Standard Twin Mattress,595.0
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0
M_PREM_T,Premium Twin Mattress,1095.0
M_PREM_K,Premium King Mattress,1995.0
P_DOWN_S,Standard Down Pillow,119.0
P_FOAM_S,Standard Foam Pillow,59.0


**CHECK YOUR WORK**

In [0]:
assert(productsDF3.count() == 12)

### 4. Write to Delta
Write `productsDF` to the filepath provided in the variable `productsOutputPath`

In [0]:
# TODO
productsOutputPath = workingDir + "/delta/products"
(productsDF
 .write
 .format("delta")
 .mode("overwrite")
 .save(productsOutputPath)
)

**CHECK YOUR WORK**

In [0]:
verify_files = dbutils.fs.ls(productsOutputPath)
verify_delta_format = False
verify_num_data_files = 0
for f in verify_files:
    if f.name == '_delta_log/':
        verify_delta_format = True
    elif f.name.endswith('.parquet'):
        verify_num_data_files += 1

assert verify_delta_format, "Data not written in Delta format"
assert verify_num_data_files > 0, "No data written"
del verify_files, verify_delta_format, verify_num_data_files

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>