
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# Set Up and Load Delta Tables

After extracting data from external data sources, load data into the Lakehouse to ensure that all of the benefits of the Databricks platform can be fully leveraged.

While different organizations may have varying policies for how data is initially loaded into Databricks, we typically recommend that early tables represent a mostly raw version of the data, and that validation and enrichment occur in later stages. This pattern ensures that even if data doesn't match expectations with regards to data types or column names, no data will be dropped, meaning that programmatic or manual intervention can still salvage data in a partially corrupted or invalid state.

This lesson will focus primarily on the pattern used to create most tables, **`CREATE TABLE _ AS SELECT`** (CTAS) statements.

## Learning Objectives
By the end of this lesson, you should be able to:
- Use CTAS statements to create Delta Lake tables
- Create new tables from existing views or tables
- Enrich loaded data with additional metadata

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

  - In the drop-down, select **More**.

  - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## Classroom Setup

Run the following cell to configure your working environment for this course. It will also set your default catalog to **dbacademy** and the schema to your specific schema name shown below using the `USE` statements.
<br></br>


```
USE CATALOG dbacademy;
USE SCHEMA dbacademy.<your unique schema name>;
```

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course.

In [0]:
%run ./Includes/Classroom-Setup-1

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


0,1
Course Catalog:,
Your Schema:,


## Querying Files
In the cell below, we are going to run a query on a directory of parquet files. These files are not currently registered as any kind of data object (i.e., a table), but we can run some kinds of queries exactly as if they were. We can run these queries on many data file types, too (CSV, JSON, etc.).

Most workflows will require users to access data from external cloud storage locations. 

In most companies, a workspace administrator will be responsible for configuring access to these storage locations. In this course, we are simply going to use data files that the `Classroom-Setup` script above installed in our workspace.


In [0]:
SELECT * 
FROM parquet.`/Volumes/dbacademy_ecommerce/v01/raw/sales-historical/` 
LIMIT 10;

order_id,email,transaction_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
257436,amanda16@skinner.com,1592193956703494,2,2190.0,1,"List(List(null, M_PREM_T, Premium Twin Mattress, 2190.0, 1095.0, 2))"
257452,jefferyfisher@yahoo.com,1592201856856023,1,1195.0,1,"List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1))"
257595,davidcollier@brown-curry.com,1592213317602596,1,945.0,1,"List(List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
257847,espears@wilson.com,1592219850060620,2,2140.0,2,"List(List(null, M_STAN_K, Standard King Mattress, 1195.0, 1195.0, 1), List(null, M_STAN_F, Standard Full Mattress, 945.0, 945.0, 1))"
275392,tracy67@carrillo-steele.com,1592424836322591,1,850.5,1,"List(List(NEWBED10, M_STAN_F, Standard Full Mattress, 850.5, 945.0, 1))"
258151,rachael13@hotmail.com,1592225139292956,1,1045.0,1,"List(List(null, M_STAN_Q, Standard Queen Mattress, 1045.0, 1045.0, 1))"
282615,ztaylor73@yahoo.com,1592504254634073,1,940.5,1,"List(List(NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1))"
258158,ialvarado33@hotmail.com,1592225276980125,1,1795.0,1,"List(List(null, M_PREM_Q, Premium Queen Mattress, 1795.0, 1795.0, 1))"
281137,christinahayes@mooney-holland.com,1592496576817530,1,535.5,1,"List(List(NEWBED10, M_STAN_T, Standard Twin Mattress, 535.5, 595.0, 1))"
258387,jesuspalmer@stuart-chambers.com,1592227891252746,1,59.0,1,"List(List(null, P_FOAM_S, Standard Foam Pillow, 59.0, 59.0, 1))"


## Create Table as Select (CTAS)

We are going to create a table that contains historical sales data from a previous point-of-sale system. This data is in the form of parquet files.

**`CREATE TABLE AS SELECT`** statements create and populate Delta tables using data retrieved from an input query. We can create the table and populate it with data at the same time.

CTAS statements automatically infer schema information from query results and do **not** support manual schema declaration. 

This means that CTAS statements are useful for external data ingestion from sources with well-defined schema, such as Parquet files and tables.

In [0]:
CREATE OR REPLACE TABLE historical_sales_bronze 
USING DELTA AS
  SELECT * 
  FROM parquet.`/Volumes/dbacademy_ecommerce/v01/raw/sales-historical/`;


DESCRIBE historical_sales_bronze;

col_name,data_type,comment
order_id,bigint,
email,string,
transaction_timestamp,bigint,
total_item_quantity,bigint,
purchase_revenue_in_usd,double,
unique_items,bigint,
items,array>,


By running `DESCRIBE <table-name>`, we can see column names and data types. We see that the schema of this table looks correct.

## Extracting CSV
We also have data in the form of CSV files. The data files have a header row that contains column names and is delimited with a "|" (pipe) character. 

We can see how this would present significant limitations when trying to ingest data from CSV files, as demonstrated in the cell below.

In [0]:
CREATE OR REPLACE TEMP VIEW sales_unparsed AS
  SELECT * 
  FROM csv.`/Volumes/dbacademy_ecommerce/v01/raw/sales-csv/`;


SELECT * 
FROM sales_unparsed 
LIMIT 10;

_c0
order_id|email|transactions_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|items
298592|sandovalaustin@holder.com|1592629288475307|1|850.5|1|[{'coupon': 'NEWBED10'
299024|msmith@monroe.com|1592636869915092|2|1092.6|2|[{'coupon': 'NEWBED10'
300048|robertstimothy@hotmail.com|1592649862529478|1|1075.5|1|[{'coupon': 'NEWBED10'
298711|lovejamie@yahoo.com|1592631406799948|1|850.5|1|[{'coupon': 'NEWBED10'
301760|jennifer7054@gmail.com|1592661071882666|1|940.5|1|[{'coupon': 'NEWBED10'
302809|ywhite@kane.org|1592665563660982|1|1075.5|1|[{'coupon': 'NEWBED10'
309136|karen61@hotmail.com|1592689638083947|1|1795.5|1|[{'coupon': 'NEWBED10'
303941|deborah18@conrad-gallagher.com|1592669885794924|1|850.5|1|[{'coupon': 'NEWBED10'
305920|khanedwin@gmail.com|1592676863608194|1|1075.5|1|[{'coupon': 'NEWBED10'


## The `read_files()` Table-Valued Function

The code in the next cell creates a table using CTAS. The `read_files()` table-valued function (TVF) allows us to read a variety of different file formats. Read more about it [here](https://docs.databricks.com/en/sql/language-manual/functions/read_files.html). The first parameter is a path to the data. The `Classroom-Setup` script (at the top of this notebook) instantiated an object that has a number of useful variables, including a path to our sample data.

We are using these options:

1. `format => "csv"` -- Our data files are in the `CSV` format
1. `sep => "|"` -- Our data fields are separated by the | (pipe) character
1. `header => true` -- The first row of data should be used as the column names
1. `mode => "FAILFAST"` -- This will cause the statement to throw an error and abort the read if there is any malformed data

In this case, we are moving existing `CSV` data, but we could just as easily use other data types by using different options.

A `_rescued_data` column is provided by default to rescue any data that doesn’t match the schema. 

For more information check out the [read_files table-valued function](https://docs.databricks.com/en/sql/language-manual/functions/read_files.html).

In [0]:
DROP TABLE IF EXISTS sales_bronze;

CREATE OR REPLACE TABLE sales_bronze 
USING DELTA AS
      SELECT * 
      FROM read_files("/Volumes/dbacademy_ecommerce/v01/raw/sales-csv/",
            format => "csv",
            sep => "|",
            header => true,
            mode => "FAILFAST");


SELECT * 
FROM sales_bronze 
LIMIT 10;

order_id,email,transactions_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items,_rescued_data
298592,sandovalaustin@holder.com,1592629288475307,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",
299024,msmith@monroe.com,1592636869915092,2,1092.6,2,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_T', 'item_name': 'Premium Twin Mattress', 'item_revenue_in_usd': 985.5, 'price_in_usd': 1095.0, 'quantity': 1}, {'coupon': 'NEWBED10', 'item_id': 'P_DOWN_S', 'item_name': 'Standard Down Pillow', 'item_revenue_in_usd': 107.10000000000001, 'price_in_usd': 119.0, 'quantity': 1}]",
300048,robertstimothy@hotmail.com,1592649862529478,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]",
298711,lovejamie@yahoo.com,1592631406799948,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",
301760,jennifer7054@gmail.com,1592661071882666,1,940.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_Q', 'item_name': 'Standard Queen Mattress', 'item_revenue_in_usd': 940.5, 'price_in_usd': 1045.0, 'quantity': 1}]",
302809,ywhite@kane.org,1592665563660982,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]",
309136,karen61@hotmail.com,1592689638083947,1,1795.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_K', 'item_name': 'Premium King Mattress', 'item_revenue_in_usd': 1795.5, 'price_in_usd': 1995.0, 'quantity': 1}]",
303941,deborah18@conrad-gallagher.com,1592669885794924,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]",
305920,khanedwin@gmail.com,1592676863608194,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]",
298795,samantha4354@hotmail.com,1592632916516773,1,985.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_T', 'item_name': 'Premium Twin Mattress', 'item_revenue_in_usd': 985.5, 'price_in_usd': 1095.0, 'quantity': 1}]",


In the next cell, run `DESCRIBE EXTENDED` on our new table and see that:

1. The column names and data types were inferred correctly
1. The table was created in our catalog and the default schema, not `hive-metastore`. These were both created for us with the `Classroom-Setup` script
1. The table is MANAGED, and we can see a path to the data in the metastore's default location
1. The table is a Delta table.
1. You own the table. This is true of everything you create, unless you change the owner

In [0]:
DESCRIBE EXTENDED sales_bronze;

col_name,data_type,comment
order_id,int,
email,string,
transactions_timestamp,bigint,
total_item_quantity,int,
purchase_revenue_in_usd,double,
unique_items,int,
items,string,
_rescued_data,string,
,,
# Delta Statistics Columns,,


## Catalogs, Schemas, and Tables on Databricks
We've created two tables so far: `historical_sales_bronze` and `sales_bronze`. But, we have not specified which schema (database) or catalog in which these tables should live. The `Classroom-Setup` script at the top of the notebook created a catalog for us and a schema. It then ran `USE` statements, so any table we create will live in the `default` schema, which lives in a catalog that is based on our username. 

Running the next cell will show you information about the catalog that was created for you by the setup script above. Normally, you could just run `DESCRIBE CATALOG <catalog_name>`, but since your catalog name was generated for you, we are using the method, `DA.catalog_name` to get this name.  
  
Note: The DA object is only used in Databricks Academy courses and is not available outside of these courses.

In [0]:
DESCRIBE CATALOG dbacademy;

info_name,info_value
Catalog Name,dbacademy
Comment,
Owner,metastore_admins
Catalog Type,Regular


Run the code below to see information about the schema (database) that was created for you. In the output below, the schema name is in the row called "Namespace Name." You can see that the schema was auto-created when the catalog was created.

Note: The `DA` object is only used in Databricks Academy courses and is not available outside of these courses.

The [IDENTIFIER clause](https://docs.databricks.com/en/sql/language-manual/sql-ref-names-identifier-clause.html) interprets a constant string as a:
- table or view name
- function name
- column name
- field name
- schema name

In [0]:
DESCRIBE SCHEMA IDENTIFIER(DA.schema_name);

database_description_item,database_description_value
Catalog Name,dbacademy
Namespace Name,labuser9051024_1737999152
Comment,
Location,
Owner,9556a37f-7dc0-4b5f-849c-babbde9b34af


## Managed and External Tables
Databricks supports tables that copy data into the metastore associated with this Databricks workspace (managed tables), as well as tables that are simply registered with the metastore but do not copy data from object storage outside the metastore location (external tables). With external tables, data remains in its original location, but you can access it from within Databricks the same way as managed tables. In fact, once a table is created, you may not even care whether or not a table is managed or external. So far, all the tables we have created are managed tables. 

We will **not** be creating external tables in this course. You can learn about creating external tables [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-external-tables.html).

## Load Incrementally

**`COPY INTO`** provides SQL engineers an idempotent option to incrementally ingest data from external systems.

Note that this operation does have some expectations:
- Data schema should be consistent
- Duplicate records should try to be excluded or handled downstream

This operation is potentially much cheaper than full table scans for data that grows predictably.

We want to capture new data but not re-ingest files that have already been read. We can use `COPY INTO` to perform this action. 

The first step is to create an empty table. We can then use COPY INTO to infer the schema of our existing data and copy data from new files that were added since the last time we ran `COPY INTO`.


In [0]:
DROP TABLE IF EXISTS users_bronze;

CREATE TABLE users_bronze 
USING DELTA;

COPY INTO loads data from data files into a Delta table. This is a retriable and idempotent operation, meaning that files in the source location that have already been loaded are skipped.

The cell below demonstrates how to use COPY INTO with a parquet source, specifying:
1. The path to the data. 

1. The FILEFORMAT of the data, in this case, parquet.
1. COPY_OPTIONS -- There are a number of key-value pairs that can be used. We are specifying that we want to merge the schema of the data.

In [0]:
COPY INTO users_bronze
  FROM '/Volumes/dbacademy_ecommerce/v01/raw/users-30m/'
  FILEFORMAT = parquet
  COPY_OPTIONS ('mergeSchema' = 'true');

num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
983,983,0


%md
## COPY INTO is Idempotent
COPY INTO keeps track of the files it has ingested previously. We can run it again, and no additional data is ingested because the files in the source directory haven't changed. Let's run the `COPY INTO` command again to show this. 

The count of total rows is the same as the `number_inserted_rows` above because no new data was copied into the table.

In [0]:
COPY INTO users_bronze
  FROM '/Volumes/dbacademy_ecommerce/v01/raw/users-30m/'
  FILEFORMAT = parquet
  COPY_OPTIONS ('mergeSchema' = 'true');

num_affected_rows,num_inserted_rows,num_skipped_corrupt_files
0,0,0


In [0]:
SELECT count(*) 
FROM users_bronze;

count(*)
983



## Creating External Tables

While Spark will extract some self-describing data sources efficiently using default settings, many formats will require declaration of schema or other options.

External tables are tables whose data is stored outside of the managed storage location specified for the metastore, catalog, or schema. Use external tables only when you require direct access to the data outside of Databricks clusters or Databricks SQL warehouses.

In order to provide access to an external storage location, a user with the necessary privileges must follow the instructions found [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-external-locations.html). Once the external location is properly configured, external tables can be created with code like this:

<strong><code>
DROP TABLE IF EXISTS sales_csv;<br />
CREATE TABLE sales_csv<br />
  (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)<br />
USING CSV<br />
OPTIONS (<br />
  header = "true",<br />
  delimiter = "|"<br />
)<br />
LOCATION "<path-to-external-location>"<br />
</code></strong>

Note the use of the **`LOCATION`** keyword, followed by a path to the pre-configured external location. When you run **`DROP TABLE`** on an external table, Unity Catalog does not delete the underlying data.

Also note that options are passed with keys as unquoted text and values in quotes. Spark supports many <a href="https://docs.databricks.com/data/data-sources/index.html" target="_blank">data sources</a> with custom options, and additional systems may have unofficial support through external <a href="https://docs.databricks.com/libraries/index.html" target="_blank">libraries</a>. 

**NOTE**: Depending on your workspace settings, you may need administrator assistance to load libraries and configure the requisite security settings for some data sources.


## Limits of Tables with External Data Sources

By using our CTAS example and our `COPY INTO` example as we have so far, we are able to take full advantage of converting our CSV data into the Delta format. This allows us to take advantage of the performance guarantees associated with Delta Lake and the Databricks Data Intelligence Platform.

If we were defining tables or queries against external data sources, we **cannot** expect the performance guarantees associated with Delta Lake and the Data Intelligence Platform.

For example: While Delta Lake tables will guarantee that you always query the most recent version of your source data, tables registered against other data sources may represent older cached versions.

## Built-In Functions

Databricks has a vast [number of built-in functions](https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin.html) you can use in your code.

We are going to create a table for user data generated by the previous point-of-sale system, but we need to make some changes. 

The `first_touch_timestamp` is in the wrong format. We need to divide the timestamp that is currently in microseconds by 1e6 (1 million). We will then use `CAST` to cast the result to a [TIMESTAMP](https://docs.databricks.com/en/sql/language-manual/data-types/timestamp-type.html). Then, we `CAST` to [DATE](https://docs.databricks.com/en/sql/language-manual/data-types/date-type.html).

Since we want to make changes to the `first_touch_timestamp` data, we need to use the `CAST` keyword. The syntax for `CAST` is `CAST(column AS data_type)`. We first cast the data to a `TIMESTAMP` and then to a `DATE`.  To use `CAST` with `COPY INTO`, we need to use a `SELECT` clause (make sure you include the parentheses) after the word `FROM` (in the `COPY INTO`).

Our **`SELECT`** clause leverages two additional built-in Spark SQL commands useful for file ingestion:
* **`current_timestamp()`** records the timestamp when the logic is executed
* **`_metadata.file_name`** records the source data file for each record in the table


In [0]:
DROP TABLE IF EXISTS users_bronze;

CREATE TABLE users_bronze;
COPY INTO users_bronze FROM
  (SELECT *, 
    cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date, 
    current_timestamp() updated,
    _metadata.file_name source_file
  FROM '/Volumes/dbacademy_ecommerce/v01/raw/users-historical/')
  FILEFORMAT = PARQUET
  COPY_OPTIONS ('mergeSchema' = 'true');


SELECT * 
FROM users_bronze LIMIT 10;

user_id,user_first_touch_timestamp,email,first_touch_date,updated,source_file
UA000000102357395,1592190121523305,jeremyfarrell@hart.net,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102357489,1592192459520769,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102357626,1592194772447739,bergjesse@yahoo.com,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102357672,1592195514566890,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102357678,1592195595064595,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102357776,1592196622138468,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102357956,1592198144189925,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102358011,1592198574912871,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102358095,1592199051437790,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet
UA000000102358668,1592201725118996,,2020-06-15,2025-01-27T18:17:22.107Z,part-00000-tid-531959640415905750-948b4f2d-2d35-46e3-97eb-e6d85d2bf872-7571-1-c000.snappy.parquet



&copy; 2025 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>