# Exploring catalogs

Sierra Publishing hired you to work as a data engineer in their centralized IT department. You receive various daily requests from different parts of your organization regarding different datasets. Historically, you have explored data programmatically by reading different datasets one at a time.

<img src='img/Data Explorer.png'>

**Instructions**

What is one of the primary benefits of using the Catalog Explorer instead of your previous approach?

- Provides the code to query a specific table. ❌
- Ability to see sample data for a particular table. ✅
- See the lineage for your data assets. ❌
- Ability to automatically clean your data. ❌

# Adding your datasets

In this exercise, you will integrate a new dataset into Databricks using csv file.

The data governance officer for Sierra Publishing has the important job of organizing and maintaining different datasets for the organization. In this scenario, you have requested that the data governance officer review some historical publication data that will help you with one of your projects. Help your colleague integrate and organize the new dataset into your Unity Catalog implementation.

**Instructions**

1. Navigate to the Data Ingestion section using the menu on the left-hand side of the UI. In the Data Ingestion window, click on Create or modify table.

2. Click browse in order to open a new pop-up window. Navigate to the Desktop folder, then click on Datasets. Select the CSV file called bx_books_file.csv and then click Open.

3. After uploading, you will see a preview of your data, as well as options to select your catalog and schema of choice. Make sure that the following settings are selected at the top of the page.
    - Catalog: Starts with databricks_ws_xxxx
    - Schema: default

    After checking the settings are as above, click Create table in the bottom right of the window.

4. If not redirected automatically, navigate to the Catalog button on the left-hand side of the Databricks UI and find the catalog and schema you used. Click on the new table name bx_bookx_file and review the Overview details of the table.

5. Looking at the overview data for the `bx-books_file` dataset, what is the data type for the "Year-Of-Publication" column?

    - string ❌
    - date ❌
    - bigint ✅

# Setting Permissions

In this exercise, you will update the permission settings of your new data table.

You are not done working with the data governance officer! Now that you have uploaded the data to a catalog and schema, you will support the data governance officer to ensure that proper access and governance to the data is provided.

As a reminder, you uploaded the bx_books_file.csv data into the following catalog and schema:

- Catalog: Starts with databricks_ws_xxxx (will be similar to the username used to log in)
- Schema: default

*We do not recommend doing so, but if you lost progress you will have to create the data table bx_books_file again using the information in the Adding your datasets exercise.*

**Instructions**

1. Navigate to your Catalog Explorer. Locate the catalog and schema where you wrote your dataset, and click on the schema.

2. You should see a Permissions for the default schema after clicking on it. Click on this to see who has access to the schema currently.

3. Now click into the table you created, and check the permissions for bx_books_file. Notice how similar or dissimilar these are to the schema-level permissions.

4. In the Permissions tab of bx_books_file, click the Grant button to provide access to your data consumers. In the Principals search, type "users" and select the group All account users to provided all privileges to

5. Which of the following statements best describes the relationship between schema-level and table-level permissions in Unity Catalog?
    - Permissions you set at the schema are forced down to the thable. You cannot change permissions at the table level. ❌
    - Permissions from the schema will trickle down to the table, but you can also change them at the table level. ✅
    - Permissions are set separately at the schema and the table levels. There is no relationship. ❌

# Node capabilities: Single vs. Multi

As one of the Databricks Workspace Administrators at Sierra Publishing, one of your tasks is to review requests for new clusters and create them for different user groups. You would like your user base to become more self sufficient and create their own clusters, so you want to create some guidelines to help your users create the right kind of cluster.

**Instructions**

For each of the following scenarios, select whether a single-node or multi-node cluster would be better and the most efficient option. Each scenario will only fit into one bucket.

- **Single-node:** Exploratory Data Analysis with pandas and seaborn, Transforming a dataset that is 30 GB in size.
- **Multi-node:** Using SparkML to train a complex AI model, Transforming a dataset that is 30 TB in size.

# Configuring clusters

You have received various requests from your central IT group to rein in the kinds of clusters that can be created in Databricks.

The IT team will start implementing different cluster policies for the different groups using the Databricks platform. Since you lead your data engineering team, the IT team would like you to provide a list of configurations you need to complete your work.

**Instructions**

Which of the following is not a valid cluster configuration in Databricks?

- Cluster monthly budget ✅
- Databricks Runtime ❌
- Auto-termination time ❌
- Node instance types ❌

# Create your first cluster

In this exercise, you will create your first cluster.

You have continued investigating the Databricks UI and how it is being implemented at Sierra Publishing. The CTO wants you to keep going and suggests that now would be an excellent opportunity to create your own clusters, something you will need to do regularly when working with Databricks!

**Instructions**

1. Navigate to the Compute section of the platform.

2. Here you will see all compute resources that you have access to. In the top right of this window, click the Create with Personal Compute button.

3. Name the new cluster "first_cluster". This cluster will be used in another section to run some quick commands.

4. Create the cluster with the following configurations:
    - Databricks runtime: 14.3 LTS (not the GPU one)
    - Node type: Standard_F4s
    - Terminate after: 10 minutes

5. Create the cluster using the appropriate button at the bottom of your screen.

6. How many DBUs per hour will this cluster cost? `0.5`