-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Databases, Tables, and Views Lab

## Learning Objectives
By the end of this lab, you should be able to:
- **Create and explore interactions between various relational entities**, including:
  - Databases
  - Tables (managed and external)
  - Views (views, temp views, and global temp views)

**Resources**
* <a href="https://docs.databricks.com/user-guide/tables.html" target="_blank">Databases and Tables - Databricks Docs</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#managed-and-unmanaged-tables" target="_blank">Managed and Unmanaged Tables</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#create-a-table-using-the-ui" target="_blank">Creating a Table with the UI</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#create-a-local-table" target="_blank">Create a Local Table</a>
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables" target="_blank">Saving to Persistent Tables</a>

### Getting Started

Run the following cell to configure variables and datasets for this lesson.

In [0]:
%run ../Includes/Classroom-Setup-3.3L

## Overview of the Data

The data include multiple entries from a selection of weather stations, including average temperatures recorded in either Fahrenheit or Celsius. The schema for the table:

|ColumnName  | DataType| Description|
|------------|---------|------------|
|NAME        |string   | Station name |
|STATION     |string   | Unique ID |
|LATITUDE    |float    | Latitude |
|LONGITUDE   |float    | Longitude |
|ELEVATION   |float    | Elevation |
|DATE        |date     | YYYY-MM-DD |
|UNIT        |string   | Temperature units |
|TAVG        |float    | Average temperature |

This data is stored in the Parquet format; preview the data with the query below.

In [0]:
%sql
SELECT * 
FROM parquet.`${da.paths.working_dir}/weather`

NAME,STATION,LATITUDE,LONGITUDE,ELEVATION,DATE,UNIT,TAVG
"HAYWARD AIR TERMINAL, CA US",USW00093228,37.6542,-122.115,13.1,2018-05-27,F,61.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-01-05,C,11.7
"SAN FRANCISCO INTERNATIONAL AIRPORT, CA US",USW00023234,37.6197,-122.3647,2.4,2018-02-24,C,8.3
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-03-26,C,9.4
"HOUSTON INTERCONTINENTAL AIRPORT, TX US",USW00012960,29.98,-95.36,29.0,2018-05-25,F,80.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-05-16,C,11.1
"BLACK DIAMOND CALIFORNIA, CA US",USR0000CBKD,37.95,-121.8844,487.7,2018-05-25,C,10.6
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-05-21,C,11.7
"WOODACRE CALIFORNIA, CA US",USR0000CWOO,37.9906,-122.6447,426.7,2018-05-26,F,53.0
"BRIONES CALIFORNIA, CA US",USR0000CBRI,37.9442,-122.1178,442.0,2018-04-08,F,53.0


## Create a Database

Create a database in the default location using the **`da.db_name`** variable defined in setup script.

In [0]:
%sql
-- TODO
create database if not exists ${da.db_name}

Run the cell below to check your work.

In [0]:
%python 
assert spark.sql(f"SHOW DATABASES").filter(f"databaseName == '{DA.db_name}'").count() == 1, "Database not present"

## Change to Your New Database

**`USE`** your newly created database.

In [0]:
%sql
-- TODO

use ${da.db_name}

Run the cell below to check your work.

In [0]:
%python
assert spark.sql(f"SHOW CURRENT DATABASE").first()["namespace"] == DA.db_name, "Not using the correct database"

## Create a Managed Table
Use a CTAS statement to create a managed table named **`weather_managed`**.

In [0]:
%sql
-- TODO

create table weather_managed as
SELECT * 
FROM parquet.`${da.paths.working_dir}/weather`

num_affected_rows,num_inserted_rows


In [0]:
%sql
select * from weather_managed

NAME,STATION,LATITUDE,LONGITUDE,ELEVATION,DATE,UNIT,TAVG
"HAYWARD AIR TERMINAL, CA US",USW00093228,37.6542,-122.115,13.1,2018-05-27,F,61.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-01-05,C,11.7
"SAN FRANCISCO INTERNATIONAL AIRPORT, CA US",USW00023234,37.6197,-122.3647,2.4,2018-02-24,C,8.3
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-03-26,C,9.4
"HOUSTON INTERCONTINENTAL AIRPORT, TX US",USW00012960,29.98,-95.36,29.0,2018-05-25,F,80.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-05-16,C,11.1
"BLACK DIAMOND CALIFORNIA, CA US",USR0000CBKD,37.95,-121.8844,487.7,2018-05-25,C,10.6
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-05-21,C,11.7
"WOODACRE CALIFORNIA, CA US",USR0000CWOO,37.9906,-122.6447,426.7,2018-05-26,F,53.0
"BRIONES CALIFORNIA, CA US",USR0000CBRI,37.9442,-122.1178,442.0,2018-04-08,F,53.0


Run the cell below to check your work.

In [0]:
%python
assert spark.table("weather_managed"), "Table named `weather_managed` does not exist"
assert spark.table("weather_managed").count() == 2559, "Incorrect row count"

## Create an External Table

Recall that an external table differs from a managed table through specification of a location. Create an external table called **`weather_external`** below.

In [0]:
%sql
-- TODO

create table weather_external
LOCATION "${da.paths.working_dir}/lab/external"
AS SELECT * 
FROM parquet.`${da.paths.working_dir}/weather`

num_affected_rows,num_inserted_rows


Run the cell below to check your work.

In [0]:
%python
assert spark.table("weather_external"), "Table named `weather_external` does not exist"
assert spark.table("weather_external").count() == 2559, "Incorrect row count"

## Examine Table Details
Use the SQL command **`DESCRIBE EXTENDED table_name`** to examine the two weather tables.

In [0]:
%sql
DESCRIBE EXTENDED weather_managed

col_name,data_type,comment
NAME,string,
STATION,string,
LATITUDE,float,
LONGITUDE,float,
ELEVATION,float,
DATE,date,
UNIT,string,
TAVG,float,
,,
# Partitioning,,


In [0]:
%sql
DESCRIBE EXTENDED weather_external

col_name,data_type,comment
NAME,string,
STATION,string,
LATITUDE,float,
LONGITUDE,float,
ELEVATION,float,
DATE,date,
UNIT,string,
TAVG,float,
,,
# Partitioning,,


Run the following helper code to extract and compare the table locations.

In [0]:
%python
def getTableLocation(tableName):
    return spark.sql(f"DESCRIBE DETAIL {tableName}").select("location").first()[0]

In [0]:
%python
managedTablePath = getTableLocation("weather_managed")
externalTablePath = getTableLocation("weather_external")

print(f"""The weather_managed table is saved at: 

    {managedTablePath}

The weather_external table is saved at:

    {externalTablePath}""")

List the contents of these directories to confirm that data exists in both locations.

In [0]:
%python
files = dbutils.fs.ls(managedTablePath)
display(files)

path,name,size,modificationTime
dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_3l.db/weather_managed/_delta_log/,_delta_log/,0,1659002813000
dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_3l.db/weather_managed/part-00000-a9f56824-6d82-451c-b352-7f43130f8dec-c000.snappy.parquet,part-00000-a9f56824-6d82-451c-b352-7f43130f8dec-c000.snappy.parquet,7642,1659002806000
dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_3l.db/weather_managed/part-00001-82ce3228-d80d-4c35-a64b-1743ad285910-c000.snappy.parquet,part-00001-82ce3228-d80d-4c35-a64b-1743ad285910-c000.snappy.parquet,7592,1659002806000
dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_3l.db/weather_managed/part-00002-222e6edf-e93e-4c4b-9ddf-87ad050dc18f-c000.snappy.parquet,part-00002-222e6edf-e93e-4c4b-9ddf-87ad050dc18f-c000.snappy.parquet,7578,1659002806000
dbfs:/user/hive/warehouse/dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_3l.db/weather_managed/part-00003-8ede16c5-8544-4e6b-b5ce-8e40c24a5f6b-c000.snappy.parquet,part-00003-8ede16c5-8544-4e6b-b5ce-8e40c24a5f6b-c000.snappy.parquet,7583,1659002806000


In [0]:
%python
files = dbutils.fs.ls(externalTablePath)
display(files)

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/_delta_log/,_delta_log/,0,1659002923000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00000-6899f663-e394-4a7c-9811-6f63a865d7cb-c000.snappy.parquet,part-00000-6899f663-e394-4a7c-9811-6f63a865d7cb-c000.snappy.parquet,7642,1659002921000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00001-8813308a-36ec-4ab6-b233-2bcfece168de-c000.snappy.parquet,part-00001-8813308a-36ec-4ab6-b233-2bcfece168de-c000.snappy.parquet,7592,1659002921000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00002-d14fa9a1-a1a6-4404-8357-21c213a41b22-c000.snappy.parquet,part-00002-d14fa9a1-a1a6-4404-8357-21c213a41b22-c000.snappy.parquet,7578,1659002921000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00003-13bf2eaa-8fbd-4c1c-9b68-c80263a8026a-c000.snappy.parquet,part-00003-13bf2eaa-8fbd-4c1c-9b68-c80263a8026a-c000.snappy.parquet,7583,1659002921000


### Check Directory Contents after Dropping Database and All Tables
The **`CASCADE`** keyword will accomplish this.

In [0]:
%sql
-- TODO

drop database ${da.db_name} cascade

Run the cell below to check your work.

In [0]:
%python
assert spark.sql(f"SHOW DATABASES").filter(f"databaseName == '{DA.db_name}'").count() == 0, "Database present"

With the database dropped, the files will have been deleted as well.

Uncomment and run the following cell, which will throw a **`FileNotFoundException`** as your confirmation.

In [0]:
%python
files = dbutils.fs.ls(managedTablePath)
display(files)

In [0]:
%python
files = dbutils.fs.ls(externalTablePath)
display(files)

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/_delta_log/,_delta_log/,0,1659002923000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00000-6899f663-e394-4a7c-9811-6f63a865d7cb-c000.snappy.parquet,part-00000-6899f663-e394-4a7c-9811-6f63a865d7cb-c000.snappy.parquet,7642,1659002921000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00001-8813308a-36ec-4ab6-b233-2bcfece168de-c000.snappy.parquet,part-00001-8813308a-36ec-4ab6-b233-2bcfece168de-c000.snappy.parquet,7592,1659002921000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00002-d14fa9a1-a1a6-4404-8357-21c213a41b22-c000.snappy.parquet,part-00002-d14fa9a1-a1a6-4404-8357-21c213a41b22-c000.snappy.parquet,7578,1659002921000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/external/part-00003-13bf2eaa-8fbd-4c1c-9b68-c80263a8026a-c000.snappy.parquet,part-00003-13bf2eaa-8fbd-4c1c-9b68-c80263a8026a-c000.snappy.parquet,7583,1659002921000


In [0]:
%python
files = dbutils.fs.ls(DA.paths.working_dir)
display(files)

path,name,size,modificationTime
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/flight_delays,flight_delays,33396236,1659002498000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/lab/,lab/,0,1659002920000
dbfs:/user/manujkumar.joshi@celebaltech.com/dbacademy/dewd/3.3l/weather/,weather/,0,1659002505000


### **This highlights the main differences between managed and external tables.** By default, the files associated with managed tables will be stored to this location on the root DBFS storage linked to the workspace, and will be deleted when a table is dropped.

### Files for external tables will be persisted in the location provided at table creation, preventing users from inadvertently deleting underlying files. **External tables can easily be migrated to other databases or renamed, but these operations with managed tables will require rewriting ALL underlying files.**

# Create a Database with a Specified Path

Assuming you dropped your database in the last step, you can use the same **`database`** name.

In [0]:
%sql
CREATE DATABASE ${da.db_name} LOCATION '${da.paths.working_dir}/${da.db_name}';
USE ${da.db_name};

Recreate your **`weather_managed`** table in this new database and print out the location of this table.

In [0]:
%sql
-- TODO
CREATE TABLE weather_managed AS
SELECT * 
FROM parquet.`${da.paths.working_dir}/weather`

num_affected_rows,num_inserted_rows


In [0]:
%sql
select * from weather_managed

NAME,STATION,LATITUDE,LONGITUDE,ELEVATION,DATE,UNIT,TAVG
"HAYWARD AIR TERMINAL, CA US",USW00093228,37.6542,-122.115,13.1,2018-05-27,F,61.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-01-05,C,11.7
"SAN FRANCISCO INTERNATIONAL AIRPORT, CA US",USW00023234,37.6197,-122.3647,2.4,2018-02-24,C,8.3
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-03-26,C,9.4
"HOUSTON INTERCONTINENTAL AIRPORT, TX US",USW00012960,29.98,-95.36,29.0,2018-05-25,F,80.0
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-05-16,C,11.1
"BLACK DIAMOND CALIFORNIA, CA US",USR0000CBKD,37.95,-121.8844,487.7,2018-05-25,C,10.6
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-05-21,C,11.7
"WOODACRE CALIFORNIA, CA US",USR0000CWOO,37.9906,-122.6447,426.7,2018-05-26,F,53.0
"BRIONES CALIFORNIA, CA US",USR0000CBRI,37.9442,-122.1178,442.0,2018-04-08,F,53.0


In [0]:
%python
getTableLocation("weather_managed")

Run the cell below to check your work.

In [0]:
%python
assert spark.table("weather_managed"), "Table named `weather_managed` does not exist"
assert spark.table("weather_managed").count() == 2559, "Incorrect row count"

While here we're using the **`working_dir`** directory created on the DBFS root, _any_ object store can be used as the database directory. **Defining database directories for groups of users can greatly reduce the chances of accidental data exfiltration**.

## Views and their Scoping

Using the provided **`AS`** clause, register:
- a view named **`celsius`**
- a temporary view named **`celsius_temp`**
- a global temp view named **`celsius_global`**

In [0]:
%sql
-- TODO

create or replace view celsius
AS (SELECT *
  FROM weather_managed
  WHERE UNIT = "C")

Run the cell below to check your work.

In [0]:
%python
assert spark.table("celsius"), "Table named `celsius` does not exist"
assert spark.sql(f"SHOW TABLES").filter(f"tableName == 'celsius'").first()["isTemporary"] == False, "Table is temporary"

Now create a temporary view.

In [0]:
%sql
-- TODO

create or replace temporary view celsius_temp
AS (SELECT *
  FROM weather_managed
  WHERE UNIT = "C")

Run the cell below to check your work.

In [0]:
%python
assert spark.table("celsius_temp"), "Table named `celsius_temp` does not exist"
assert spark.sql(f"SHOW TABLES").filter(f"tableName == 'celsius_temp'").first()["isTemporary"] == True, "Table is not temporary"

Now register a global temp view.

In [0]:
%sql
-- TODO

create or replace global temporary view celsius_global
AS (SELECT *
  FROM weather_managed
  WHERE UNIT = "C")

Run the cell below to check your work.

In [0]:
%python
assert spark.table("global_temp.celsius_global"), "Global temporary view named `celsius_global` does not exist"

Views will be displayed alongside tables when listing from the catalog.

In [0]:
%sql
SHOW TABLES

database,tableName,isTemporary
dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_3l,celsius,False
dbacademy_manujkumar_joshi_celebaltech_com_dewd_3_3l,weather_managed,False
,celsius_temp,True


Note the following:
- The view is associated with the current database. This view will be available to any user that can access this database and will persist between sessions.
- The temp view is not associated with any database. The temp view is ephemeral and is only accessible in the current SparkSession.
- The global temp view does not appear in our catalog. **Global temp views will always register to the **`global_temp`** database**. The **`global_temp`** database is ephemeral but tied to the lifetime of the cluster; however, it is only accessible by notebooks attached to the same cluster on which it was created.

In [0]:
%sql
SELECT * FROM global_temp.celsius_global

NAME,STATION,LATITUDE,LONGITUDE,ELEVATION,DATE,UNIT,TAVG
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-01-05,C,11.7
"SAN FRANCISCO INTERNATIONAL AIRPORT, CA US",USW00023234,37.6197,-122.3647,2.4,2018-02-24,C,8.3
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-03-26,C,9.4
"BIG ROCK CALIFORNIA, CA US",USR0000CBIR,38.0394,-122.57,457.2,2018-05-16,C,11.1
"BLACK DIAMOND CALIFORNIA, CA US",USR0000CBKD,37.95,-121.8844,487.7,2018-05-25,C,10.6
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-05-21,C,11.7
"SAN FRANCISCO INTERNATIONAL AIRPORT, CA US",USW00023234,37.6197,-122.3647,2.4,2018-03-26,C,11.7
"LAS TRAMPAS CALIFORNIA, CA US",USR0000CTRA,37.8339,-122.0669,536.4,2018-01-30,C,15.0
"BLACK DIAMOND CALIFORNIA, CA US",USR0000CBKD,37.95,-121.8844,487.7,2018-04-30,C,10.6
"HOUSTON WILLIAM P HOBBY AIRPORT, TX US",USW00012918,29.63806,-95.28194,13.4,2018-04-03,C,23.9


While no job was triggered when defining these views, a job is triggered _each time_ a query is executed against the view.

## Clean Up
Drop the database and all tables to clean up your workspace.

In [0]:
%sql
DROP DATABASE ${da.db_name} CASCADE

## Synopsis

In this lab we:
- Created and deleted databases
- Explored behavior of managed and external tables
- Learned about the scoping of views

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>