-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Databases and Tables on Databricks
In this demonstration, you will create and explore databases and tables.

## Learning Objectives
By the end of this lesson, you should be able to:
* Use Spark SQL DDL to define databases and tables
* Describe how the **`LOCATION`** keyword impacts the default storage directory



**Resources**
* <a href="https://docs.databricks.com/user-guide/tables.html" target="_blank">Databases and Tables - Databricks Docs</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#managed-and-unmanaged-tables" target="_blank">Managed and Unmanaged Tables</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#create-a-table-using-the-ui" target="_blank">Creating a Table with the UI</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#create-a-local-table" target="_blank">Create a Local Table</a>
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables" target="_blank">Saving to Persistent Tables</a>

## Lesson Setup
The following script clears out previous runs of this demo and configures some Hive variables that will be used in our SQL queries.

In [0]:
%run ../Includes/Classroom-Setup-3.1

## Using Hive Variables

While not a pattern that is generally recommended in Spark SQL, this notebook will use some Hive variables to substitute in string values derived from the account email of the current user.

The following cell demonstrates this pattern.

In [0]:
%sql
SELECT "${da.db_name}" AS db_name,
       "${da.paths.working_dir}" AS working_dir

db_name,working_dir
dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1,dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1


Because you may be working in a shared workspace, this course uses variables derived from your username so the databases don't conflict with other users. Again, consider this use of Hive variables a hack for our lesson environment rather than a good practice for development.

# Databases
Let's start by creating two databases:
- One with no **`LOCATION`** specified
- One with **`LOCATION`** specified

In [0]:
%sql
CREATE DATABASE IF NOT EXISTS ${da.db_name}_default_location;
CREATE DATABASE IF NOT EXISTS ${da.db_name}_custom_location LOCATION '${da.paths.working_dir}/_custom_location.db';

Note that the location of the first database is in the default location under **`dbfs:/user/hive/warehouse/`** and that the database directory is the name of the database with the **`.db`** extension

In [0]:
%sql
DESCRIBE DATABASE EXTENDED ${da.db_name}_default_location;

database_description_item,database_description_value
Namespace Name,dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1_default_location
Comment,
Location,dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1_default_location.db
Owner,root
Properties,


Note that the location of the second database is in the directory specified after the **`LOCATION`** keyword.

In [0]:
%sql
DESCRIBE DATABASE EXTENDED ${da.db_name}_custom_location;

database_description_item,database_description_value
Namespace Name,dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1_custom_location
Comment,
Location,dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/_custom_location.db
Owner,root
Properties,


### We will create a table in the database with default location and insert data. 

Note that the schema must be provided because there is no data from which to infer the schema.

In [0]:
%sql
USE ${da.db_name}_default_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_default_location (width INT, length INT, height INT);
INSERT INTO managed_table_in_db_with_default_location 
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_default_location;

width,length,height
3,2,1


We can look at the extended table description to find the location (you'll need to scroll down in the results).

In [0]:
%sql
DESCRIBE EXTENDED managed_table_in_db_with_default_location;

col_name,data_type,comment
width,int,
length,int,
height,int,
,,
# Partitioning,,
Not partitioned,,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1_default_location,


### By default, managed tables in a database without the location specified will be created in the **`dbfs:/user/hive/warehouse/<database_name>.db/`** directory.

We can see that, as expected, the data and metadata for our Delta Table are stored in that location.

In [0]:
%python 
hive_root =  f"dbfs:/user/hive/warehouse"
db_name =    f"{DA.db_name}_default_location.db"
table_name = f"managed_table_in_db_with_default_location"

tbl_location = f"{hive_root}/{db_name}/{table_name}"
print(tbl_location)

files = dbutils.fs.ls(tbl_location)
display(files)

path,name,size,modificationTime
dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1_default_location.db/managed_table_in_db_with_default_location/_delta_log/,_delta_log/,0,1658999609000
dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1_default_location.db/managed_table_in_db_with_default_location/part-00000-2dd9ac28-2fcb-40f7-a71f-562860cdbf04-c000.snappy.parquet,part-00000-2dd9ac28-2fcb-40f7-a71f-562860cdbf04-c000.snappy.parquet,1045,1658999608000


Drop the table.

In [0]:
%sql
DROP TABLE managed_table_in_db_with_default_location;

#### Note the table's directory and its log and data files are deleted. Only the database directory remains.

In [0]:
%python 

db_location = f"{hive_root}/{db_name}"
print(db_location)
dbutils.fs.ls(db_location)

### We now create a table in  the database with custom location and insert data. 

Note that the schema must be provided because there is no data from which to infer the schema.

In [0]:
%sql
USE ${da.db_name}_custom_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_custom_location (width INT, length INT, height INT);
INSERT INTO managed_table_in_db_with_custom_location VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_custom_location;

width,length,height
3,2,1


Again, we'll look at the description to find the table location.

In [0]:
%sql
DESCRIBE EXTENDED managed_table_in_db_with_custom_location;

col_name,data_type,comment
width,int,
length,int,
height,int,
,,
# Partitioning,,
Not partitioned,,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_1_custom_location,


As expected, this managed table is created in the path specified with the **`LOCATION`** keyword during database creation. As such, the data and metadata for the table are persisted in a directory here.

In [0]:
%python 

table_name = f"managed_table_in_db_with_custom_location"
tbl_location =   f"{DA.paths.working_dir}/_custom_location.db/{table_name}"
print(tbl_location)

files = dbutils.fs.ls(tbl_location)
display(files)

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/_custom_location.db/managed_table_in_db_with_custom_location/_delta_log/,_delta_log/,0,1658999614000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/_custom_location.db/managed_table_in_db_with_custom_location/part-00000-76a5a932-528f-4fb6-94cd-58ae7e62e22f-c000.snappy.parquet,part-00000-76a5a932-528f-4fb6-94cd-58ae7e62e22f-c000.snappy.parquet,1045,1658999613000


Let's drop the table.

In [0]:
%sql
DROP TABLE managed_table_in_db_with_custom_location;

#### Note the table's folder and the log file and data file are deleted.  
  
Only the database location remains

In [0]:
%python 

db_location =   f"{DA.paths.working_dir}/_custom_location.db"
print(db_location)

dbutils.fs.ls(db_location)


# Tables
We will create an external (unmanaged) table from sample data. 

The data we are going to use are in CSV format. We want to create a Delta table with a **`LOCATION`** provided in the directory of our choice.

In [0]:
%sql
USE ${da.db_name}_default_location;

CREATE OR REPLACE TEMPORARY VIEW temp_delays USING CSV OPTIONS (
  path = '${da.paths.working_dir}/flights/departuredelays.csv',
  header = "true",
  mode = "FAILFAST" -- abort file parsing with a RuntimeException if any malformed lines are encountered
);
CREATE OR REPLACE TABLE external_table LOCATION '${da.paths.working_dir}/external_table' AS
  SELECT * FROM temp_delays;

SELECT * FROM external_table;

date,delay,distance,origin,destination
1011245,6,602,ABE,ATL
1020600,-8,369,ABE,DTW
1021245,-2,602,ABE,ATL
1020605,-4,602,ABE,ATL
1031245,-4,602,ABE,ATL
1030605,0,602,ABE,ATL
1041243,10,602,ABE,ATL
1040605,28,602,ABE,ATL
1051245,88,602,ABE,ATL
1050605,9,602,ABE,ATL


Let's note the location of the table's data in this lesson's working directory.

In [0]:
%sql
DESCRIBE TABLE EXTENDED external_table;

col_name,data_type,comment
date,string,
delay,string,
distance,string,
origin,string,
destination,string,
,,
# Partitioning,,
Not partitioned,,
,,
# Detailed Table Information,,


Now, we drop the table.

In [0]:
%sql
DROP TABLE external_table;

The table definition no longer exists in the metastore, but the underlying data remain intact.

In [0]:
%python 
tbl_path = f"{DA.paths.working_dir}/external_table"
files = dbutils.fs.ls(tbl_path)
display(files)

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/external_table/_delta_log/,_delta_log/,0,1658999754000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/external_table/part-00000-5bb67393-f167-4c00-9e01-ea5ce21ac9aa-c000.snappy.parquet,part-00000-5bb67393-f167-4c00-9e01-ea5ce21ac9aa-c000.snappy.parquet,1814492,1658999752000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/external_table/part-00001-4b7407a1-0fee-4906-a270-72e770df9f17-c000.snappy.parquet,part-00001-4b7407a1-0fee-4906-a270-72e770df9f17-c000.snappy.parquet,1992029,1658999752000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/external_table/part-00002-40efc8bd-80df-45c9-96b6-a9679f5558ca-c000.snappy.parquet,part-00002-40efc8bd-80df-45c9-96b6-a9679f5558ca-c000.snappy.parquet,1949998,1658999752000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.1/external_table/part-00003-4516d586-0280-4541-b36e-a86d472f1fe6-c000.snappy.parquet,part-00003-4516d586-0280-4541-b36e-a86d472f1fe6-c000.snappy.parquet,983852,1658999752000


## Clean up
Drop both databases.

In [0]:
%sql
DROP DATABASE ${da.db_name}_default_location CASCADE;
DROP DATABASE ${da.db_name}_custom_location CASCADE;

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python 
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>