d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Reading and Writing to Azure SQL Data Warehouse
**Technical Accomplishments:**
- Access an Azure SQL Data Warehouse using the SQL Data Warehouse connector

**Requirements:**
- A database master key for the Azure SQL Data Warehouse

You will create a data warehouse and database master key for it in the steps below.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Azure SQL Data Warehouse
Azure SQL Data Warehouse leverages massively parallel processing (MPP) to quickly run complex queries across petabytes of data.

Import big data into SQL Data Warehouse with simple PolyBase T-SQL queries, and then use MPP to run high-performance analytics.

As you integrate and analyze, the data warehouse will become the single version of truth your business can count on for insights.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) SQL Data Warehouse Connector
- Use Azure Blob Storage as an intermediary between Azure Databricks and SQL Data Warehouse
- In Azure Databricks: triggers Spark jobs to read and write data to Blob Storage
- In SQL Data Warehouse: triggers data loading and unloading operations, performed by **PolyBase**

**Note:** The SQL DW connector is more suited to ETL than to interactive queries.  
For interactive and ad-hoc queries, data should be extracted into a Databricks Delta table.

![](https://files.training.databricks.com/images/adbcore/AAHRBWKzrNVMUpfjecWUpfRb9p8pVZl7fsMB.png)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Types of Connections in SQL Data Warehouse

### **Spark Driver to SQL DW**
Spark driver connects to SQL Data Warehouse via JDBC using a username and password.

### **Spark Driver and Executors to Azure Blob Storage**
Spark uses the **Azure Blob Storage connector** bundled in Databricks Runtime to connect to the Blob Storage container.
  - Requires **`wasbs`** URI scheme to specify connection
  - Requires **storage account access key** to set up connection
    - Set in a notebook's session configuration, which doesn't affect other notebooks attached to the same cluster
    - **`spark`** is the SparkSession object provided in the notebook

### **SQL DW to Azure Blob Storage**
SQL DW connector forwards the access key from notebook session configuration to SQL Data Warehouse instance over JDBC.
  - Requires **`forwardSparkAzureStorageCredentials`** set to **`true`**
  - Represents access key with a temporary <a href="https://docs.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql?view=sql-server-2017" target="_blank">database scoped credential</a> in the SQL Data Warehouse instance
  - Creates a database scoped credential before asking SQL DW to load or unload data, and deletes after loading/unloading is finished

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Configuration

### Create Azure Blob Storage
Follow these steps to <a href="https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=azure-portal#regenerate-storage-access-keys" target="_blank">create an Azure Storage Account</a> and Container.  
The SQL DW connector will use a <a href="https://docs.microsoft.com/en-us/rest/api/storageservices/authorize-with-shared-key" target="_blank">Shared Key</a> for authorization.

As you work through the following steps, record the **Storage Account Name**, **Container Name**, and **Access Key** in the cell below:
0. Access the Azure Portal > Create a new resource > Storage account
0. Specify the correct *Resource Group* and *Region*, and use any unique string for the **Storage Account Name**
0. Access the new Storage account > Access Blobs
0. Create a New Container using any unique string for the **Container Name**
0. Retrieve the primary **Access Key** for the new Storage Account

In [0]:
# TODO
storageAccount = ""
containerName = ""
accessKey = ""

spark.conf.set("fs.azure.account.key.{}.blob.core.windows.net".format(storageAccount), accessKey)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Configuration

### Create Azure SQL Data Warehouse
Follow these steps to <a href="https://docs.microsoft.com/en-us/azure/sql-data-warehouse/create-data-warehouse-portal" target="_blank">create an Azure SQL Data Warehouse</a>.

0. Access the Azure Portal > Create a new resource > SQL Data Warehouse
0. Specify the following attributes for the SQL Data Warehouse:
   - Use any string for the **Data warehouse name**
   - Select an existing or create a new SQL Server
   - Under the **Additional Settings** tab, select **Sample** for the **Use existing data** option
0. Access the new SQL Data Warehouse
0. Select **Query Editor (preview)** under **Common Tasks** in the sidebar and enter the proper credentials
0. Run these two queries:
   - Create a Master Key in the SQL DW. This facilitates the SQL DW connection
   
     **`CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'CORRECT-horse-battery-staple';`**

   - Use a CTAS to create a staging table for the Customer Table. This query will create an empty table with the same schema as the Customer Table.
   
     **`CREATE TABLE dbo.DimCustomerStaging`**  
     **`WITH`**  
     **`( DISTRIBUTION = ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX )`**  
     **`AS`**  
     **`SELECT  *`**  
     **`FROM dbo.DimCustomer`**  
     **`WHERE 1 = 2`**  
     **`;`**

0. Access Connection Strings
0. Select JDBC and copy the **JDBC URI**

In [0]:
# TODO
tableName = "dbo.DimCustomer"
jdbcURI = ""

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Read from the Customer Table

Use the SQL DW Connector to read data from the Customer Table.

Use the read to define a temporary table that can be queried.

Note the following options in the DataFrameReader in the cell below:
* **`url`** specifies the JDBC connection to the SQL Data Warehouse
* **`tempDir`** specifies the **`wasbs`** URI of the caching directory on the Azure Blob Storage container
* **`forwardSparkAzureStorageCredentials`** is set to **`true`** to ensure that the Azure storage account access keys are forwarded from the notebook's session configuration to the SQL Data Warehouse

In [0]:
cacheDir = "wasbs://{}@{}.blob.core.windows.net/cacheDir".format(containerName, storageAccount)

customerDF = (spark.read
  .format("com.databricks.spark.sqldw")
  .option("url", jdbcURI)
  .option("tempDir", cacheDir)
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("dbTable", tableName)
  .load())

customerDF.createOrReplaceTempView("customer_data")

Use SQL queries to count the number of rows in the Customer table and to display table metadata.

Note that **`CustomerKey`** and **`CustomerAlternateKey`** use a very similar naming convention.

When merging many new customers into this table, we may have issues with uniqueness in the **`CustomerKey`**. 

Let's redefine **`CustomerAlternateKey`** for stronger uniqueness using a <a href="https://en.wikipedia.org/wiki/Universally_unique_identifier" target="_blank">UUID</a>. To do this, we will define a UDF and use it to transform the **`CustomerAlternateKey`** column.

In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import uuid

uuidUdf = udf(lambda : str(uuid.uuid4()), StringType())
customerUpdatedDF = customerDF.withColumn("CustomerAlternateKey", uuidUdf())
display(customerUpdatedDF)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Use the Polybase Connector to Write to the Staging Table

Use the SQL DW Connector to write the updated customer table to a staging table.

It is best practice to update the SQL Data Warehouse via a staging table.

Note the following options in the DataFrameWriter in the cell below:
* **`url`** specifies the JDBC connection to the SQL Data Warehouse
* **`tempDir`** specifies the **`wasbs`** URI of the caching directory on the Azure Blob Storage container
* **`forwardSparkAzureStorageCredentials`** is set to **`true`** to ensure that the Azure storage account access keys are forwarded from the notebook's session configuration to the SQL Data Warehouse

These options are the same as those in the DataFrameReader above.

In [0]:
(customerUpdatedDF.write
  .format("com.databricks.spark.sqldw")
  .mode("overwrite")
  .option("url", jdbcURI)
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("dbtable", tableName + "Staging")
  .option("tempdir", cacheDir)
  .save())

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Read From the New Staging Table
Use the SQL DW Connector to read the new table we just wrote.

Use the read to define a temporary table that can be queried.

In [0]:
customerTempDF = (spark.read
  .format("com.databricks.spark.sqldw")
  .option("url", jdbcURI)
  .option("tempDir", cacheDir)
  .option("forwardSparkAzureStorageCredentials", "true")
  .option("dbTable", tableName + "Staging")
  .load())

customerTempDF.createOrReplaceTempView("customer_temp_data")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>