
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# 2.3 - Project Setup Exploration

Explore the isolated catalogs within the project.

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

  - In the drop-down, select **More**.

  - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

1. Find the triangle icon to the right of your compute cluster name and click it.

1. Wait a few minutes for the cluster to start.

1. Once the cluster is running, complete the steps above to select your cluster.

## A. Classroom Setup



**NOTES:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course.

#### The notebook "2.1 - Modularizing PySpark Code - Required" sets up the catalogs for this course. If you have not run this notebook, the catalogs will not be available.

In [0]:
%run ../Includes/Classroom-Setup-2.3

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Set DA dynamic references to the dev, stage and prod catalogs.



0,1
DEV catalog reference: DA.catalog_dev:,
STAGE catalog reference: DA.catalog_stage:,
PROD catalog reference: DA.catalog_prod:,


## B. Explore Your Environment


1. Let's quickly explore the course folder structure and files for this course. We will explore each of these folders in depth throughout the course:

    a. In the left navigation bar, make sure to select the folder icon.

    b. Navigate back to the main course folder **DevOps Essentials for Data Engineering**. It contains the **Course Notebooks**, **src**, and **tests** folders (two folders back from this notebook).

    c. Your main course folder includes the following:

    - **Course Notebooks** folder: This folder contains the notebooks and files to follow along with during class demonstrations and labs.

    - **src** folder: Contains the source production notebooks and Python files for the project.

    - **tests** folder: Contains the unit and integration tests for our project.
    
    - A variety of files that will be used in the course.

    d. Navigate back to **Course Notebooks** -> **M02 - CI**. We will dive into each of these folders and files throughout the course project.

2. Your environment has been configured with the following catalogs and files:

- Catalog: **unique_catalog_name_1_dev**
  - Schema: **default**
    - Volume: **health**
      - *dev_health.csv* : Small subset of prod data, anonymized *PII*, 7,500 rows

- Catalog: **unique_catalog_name_2_stage**
  - Schema: **default**
    - Volume: **health**
      - *stage_health.csv* : Subset of prod data, 35,000 rows

- Catalog: **unique_catalog_name_3_prod**
  - Schema: **default**
    - Volume: **health**
      - *2025-01-01_health.csv*
      - *2025-01-02_health.csv*
      - *2025-01-03_health.csv*
      - CSV files are added to this cloud storage location daily.Manually view your catalogs for this course.

<br></br>
#### Complete the following to explore your catalogs:

a. In the navigation bar on the left, select the catalog icon.

b. Confirm your environment has created three new catalogs for you with the specified volume and files from above:

- **unique_name_1_dev**

- **unique_name_2_stage**

- **unique_name_3_prod**


**NOTE:** If you do not have the catalogs specified from above, you will need to run the **M02 - CI/2.1 - Modularizing PySpark Code - REQUIRED** notebook.

3. Throughout the course, the following Python variables will be used to dynamically reference your course catalogs:
    - `DA.catalog_dev`
    - `DA.catalog_stage`
    - `DA.catalog_prod`

    Run the code below and confirm the variables refer to your catalog names.


In [0]:
print(f'DA.catalog_dev: {DA.catalog_dev}')
print(f'DA.catalog_stage: {DA.catalog_stage}')
print(f'DA.catalog_prod: {DA.catalog_prod}')

DA.catalog_dev: labuser9989464_1744809149_1_dev
DA.catalog_stage: labuser9989464_1744809149_2_stage
DA.catalog_prod: labuser9989464_1744809149_3_prod


## C. Explore the Dev Data

1. Complete the following to manually view the development CSV file in the volume within the dev catalog:

    a. Expand the catalog **unique_name_1_dev**.

    b. Expand the **default** schema. 

    c. Confirm that a volume named **health** was created. 

    d. Expand the **health** volume and confirm the *dev_health.csv* file is available.

2. Run the following cell to view the raw dev CSV file and see the number of rows. 

    Notice that the *dev_health.csv* file:
    - Contains columns delimited by a comma with the **PII** column completely masked for simplicity.
    
    - Contains 7500 rows (7501 - 1 for the header column).

**NOTE:** The *dev_health.csv* file is a small sample of the production data that we will use for development and testing.

In [0]:
spark.sql(f'''
SELECT *
FROM text.`/Volumes/{DA.catalog_dev}/default/health/dev_health.csv`
''').display()

value
"ID,PII,date,HighCholest,HighBP,BMI,Age,Education,income"
"0,********,2025-01-01,0,1.0,26.0,4.0,6.0,94607"
"1,********,2025-01-01,1,1.0,26.0,12.0,6.0,78231"
"2,********,2025-01-01,0,0.0,26.0,null,6.0,99310"
"3,********,2025-01-01,2,1.0,28.0,null,6.0,99665"
"4,********,2025-01-01,0,0.0,29.0,8.0,5.0,81819"
"5,********,2025-01-01,0,0.0,18.0,null,4.0,65444"
"6,********,2025-01-01,1,0.0,26.0,13.0,5.0,49569"
"7,********,2025-01-01,0,0.0,31.0,null,4.0,18870"
"8,********,2025-01-01,0,0.0,32.0,3.0,6.0,78247"



&copy; 2025 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>