# Databricks Overview: Discovering and Exploring Data

In this notebook, you'll learn how to discover and explore data assets in Databricks using:
- **Catalog Explorer** - UI-based data discovery
- **SQL commands** - Programmatic data exploration
- **Entity Relationship Diagrams** - Understanding table relationships
- **Table Insights** - Viewing usage patterns

## Prerequisites
Before starting, ensure you've run the setup notebook to create the IoT dataset with:
- Dimension tables (`dim_factories`, `dim_models`, `dim_devices`)
- Bronze tables (`sensor_bronze`, `inspection_bronze`)
- Silver tables (`anomaly_detected`, `inspection_silver`)
- Gold table (`inspection_gold`)

---

## Table of Contents
1. [Discovering Data with Catalog Explorer](#catalog-explorer)
2. [Exploring Database Objects](#database-objects)
3. [Exploring Files in Volumes](#volumes)
4. [Viewing Entity Relationships](#entity-relationships)
5. [Understanding Table Insights](#table-insights)

---

Reference Documentation:
- [Discover Data](https://docs.databricks.com/aws/en/discover/)
- [Catalog Explorer](https://docs.databricks.com/aws/en/catalog-explorer/)
- [Explore Storage and Files](https://docs.databricks.com/aws/en/discover/files)
- [Explore Database Objects](https://docs.databricks.com/aws/en/discover/database-objects)

In [0]:
# Set your catalog and schema
CATALOG = 'default'  # Change to your catalog name
SCHEMA = 'db_crash_course'  # Change to match your setup

print(f"Using: {CATALOG}.{SCHEMA}")

## 1. Discovering Data with Catalog Explorer <a id="catalog-explorer"></a>

**Catalog Explorer** is a UI tool for exploring and managing data assets. You can access it by clicking **Catalog** in the sidebar.

### What You Can Do with Catalog Explorer:
- **Find data assets** - Browse catalogs, schemas, tables, and volumes
- **Preview data** - View sample data and schema details
- **Manage Unity Catalog** - Create objects, manage permissions, view ownership
- **AI-assisted discovery** - Use AI-generated comments and natural language queries

### Exercise: Open Catalog Explorer
1. Click the **Catalog** icon in the left sidebar
2. Navigate to your catalog (e.g., `default`)
3. Find your schema (e.g., `db_crash_course`)
4. Explore the tables created during setup

You should see:
- **Dimension Tables**: `dim_factories`, `dim_models`, `dim_devices`
- **Bronze Tables**: `sensor_bronze`, `inspection_bronze`
- **Silver Tables**: `anomaly_detected`, `inspection_silver`
- **Gold Table**: `inspection_gold`

## 2. Exploring Database Objects <a id="database-objects"></a>

Now let's use SQL to programmatically explore our database objects. You can discover catalogs, schemas, tables, and their metadata using `SHOW` and `DESCRIBE` commands.


In [0]:
# Show all tables in your schema
spark.sql(f"SHOW TABLES IN {CATALOG}.{SCHEMA}").display()


### Describe Table Schema

Use `DESCRIBE TABLE` to see column names, data types, and comments:


In [0]:
# Describe the sensor_bronze table
spark.sql(f"DESCRIBE TABLE {CATALOG}.{SCHEMA}.sensor_bronze").display()


### Get Extended Table Information

Use `DESCRIBE TABLE EXTENDED` to see detailed metadata including location, provider, and properties:


In [0]:
# Get extended information about the table
spark.sql(f"DESCRIBE TABLE EXTENDED {CATALOG}.{SCHEMA}.sensor_bronze").display()


### View Table History (Delta Lake Feature)

Delta Lake maintains a transaction log. You can view the table's history of operations:


In [0]:
# View table history (shows all operations: CREATE, WRITE, MERGE, etc.)
spark.sql(f"DESCRIBE HISTORY {CATALOG}.{SCHEMA}.sensor_bronze").display()


### Preview Sample Data

Let's preview the data in our tables:


In [0]:
# Preview dimension tables
print("Factories:")
spark.table(f"{CATALOG}.{SCHEMA}.dim_factories").display()

print("\nModels:")
spark.table(f"{CATALOG}.{SCHEMA}.dim_models").display()

print("\nDevices (sample):")
spark.table(f"{CATALOG}.{SCHEMA}.dim_devices").limit(10).display()

In [0]:
# Preview sensor readings
print("Sample Sensor Readings:")
spark.table(f"{CATALOG}.{SCHEMA}.sensor_bronze").limit(10).display()

## 3. Exploring Files in Volumes <a id="volumes"></a>

Unity Catalog **Volumes** provide managed access to files in cloud object storage. Our setup created volumes for storing raw data files.

### List Volumes


In [0]:
# Show all volumes in the schema
spark.sql(f"SHOW VOLUMES IN {CATALOG}.{SCHEMA}").display()

### Describe a Volume


In [0]:
# Get details about a specific volume
spark.sql(f"DESCRIBE VOLUME {CATALOG}.{SCHEMA}.sensor_data").display()

### List Files in a Volume

Use the `LIST` command or Databricks utilities to explore files:


In [0]:
# List files in sensor_data volume using SQL
spark.sql(f"LIST '/Volumes/{CATALOG}/{SCHEMA}/sensor_data/'").display()

In [0]:
# List files in inspection_data volume using dbutils
files = dbutils.fs.ls(f"/Volumes/{CATALOG}/{SCHEMA}/inspection_data/")
for file in files:
    print(f"  {file.name} - {file.size:,} bytes")

### Exercise: Explore Volumes in Catalog Explorer

1. In Catalog Explorer, navigate to your schema
2. Click on **Volumes** to expand the list
3. Click on `sensor_data` volume
4. In the **Details** tab, you can see:
   - Volume type
   - Storage location
   - Owner information
5. Browse the files within the volume


## 4. Viewing Entity Relationships <a id="entity-relationships"></a>

Our IoT dataset uses a **star schema** with primary and foreign key relationships. Databricks Catalog Explorer can visualize these relationships using an **Entity Relationship Diagram (ERD)**.

### Understanding Our Data Model

**Dimension Tables (with PRIMARY KEYs):**
- `dim_factories` - Factory reference data, PK(`factory_id`)
- `dim_models` - Device model reference data, PK(`model_id`)
- `dim_devices` - Device master data, PK(`device_id`), FK→factories, FK→models

**Fact Tables (with FOREIGN KEYs):**
- `sensor_bronze` - IoT sensor readings, FK→devices, FK→factories, FK→models
- `inspection_bronze` - Inspection records, FK→devices


### View Primary and Foreign Keys


In [0]:
# Show constraints on dim_devices table
# This table has a PRIMARY KEY and two FOREIGN KEYs
spark.sql(f"""
    DESCRIBE TABLE EXTENDED {CATALOG}.{SCHEMA}.dim_devices
""").filter("col_name LIKE '%Constraint%' OR col_name = 'Table Constraints'").display()

In [0]:
# Show constraints on sensor_bronze table
# This table has FOREIGN KEYs to dimension tables
spark.sql(f"""
    DESCRIBE TABLE EXTENDED {CATALOG}.{SCHEMA}.sensor_bronze
""").filter("col_name LIKE '%Constraint%' OR col_name = 'Table Constraints'").display()


### Exercise: View the Entity Relationship Diagram

**To view the ERD in Catalog Explorer:**

1. Navigate to Catalog Explorer
2. Select one of the tables with foreign keys (e.g., `sensor_bronze` or `dim_devices`)
3. Click on the **Columns** tab
4. Click **View relationships** button (top-right)
5. The ERD will display showing:
   - Primary key columns (with key icon)
   - Foreign key relationships (with connecting lines)
   - Related tables in your schema

The ERD provides an intuitive visualization of how data entities connect, making it easy to understand the data model at a glance.

**Reference:** [Entity Relationship Diagram Documentation](https://docs.databricks.com/aws/en/catalog-explorer/entity-relationship-diagram)


### Query with Joins

Since we have defined relationships, we can easily join across dimension and fact tables:


In [0]:
# Join sensor data with dimension tables to get enriched information
query = f"""
SELECT 
    s.device_id,
    f.factory_name,
    f.region,
    m.model_name,
    m.model_category,
    s.timestamp,
    s.temperature,
    s.rotation_speed,
    s.air_pressure
FROM {CATALOG}.{SCHEMA}.sensor_bronze s
JOIN {CATALOG}.{SCHEMA}.dim_devices d ON s.device_id = d.device_id
JOIN {CATALOG}.{SCHEMA}.dim_factories f ON s.factory_id = f.factory_id
JOIN {CATALOG}.{SCHEMA}.dim_models m ON s.model_id = m.model_id
ORDER BY s.timestamp DESC
LIMIT 20
"""

spark.sql(query).display()

## 5. Understanding Table Insights <a id="table-insights"></a>

Unity Catalog tracks **table usage metadata** for the past 30 days. You can see:
- **Frequent queries** - Most common queries accessing the table
- **Top users** - Users who access the table most often
- **Query patterns** - How the data is being used

This helps you understand:
- Which tables are most valuable
- Who your data consumers are
- How to optimize frequently-run queries


### Exercise: View Table Insights in Catalog Explorer

**To view insights for a table:**

1. Open Catalog Explorer
2. Navigate to a table (e.g., `sensor_bronze`)
3. Click on the **Insights** tab
4. You'll see:
   - **Frequent queries** section showing most-run queries on this table
   - **Top users** section showing who accesses this table
   - Query frequency over time

**Note:** Since this is a new dataset, you may not see much activity yet. Table insights populate as the table is queried over time.

**Reference:** [Table Insights Documentation](https://docs.databricks.com/aws/en/discover/table-insights)


### Additional Exploration: Search for Tables

Databricks provides powerful search capabilities:

1. **Keyword Search**: Use the search bar at the top of Databricks to find tables by name
2. **Semantic Search**: Search for concepts (e.g., "sensor temperature") to find relevant datasets
3. **Column-level Search**: Search returns results based on table names, column names, and comments

**Exercise:**
1. Click the search bar at the top of the Databricks UI
2. Search for "sensor" - you should see `sensor_bronze` and related tables
3. Try searching for "factory" or "inspection"

The search only returns tables you have permission to see and searches across:
- Table names
- Column names  
- Table comments
- Column comments


## Summary

In this notebook, you learned how to:

✅ **Discover data** using Catalog Explorer's UI-based tools  
✅ **Explore database objects** programmatically with SQL commands (`SHOW`, `DESCRIBE`, `LIST`)  
✅ **Work with Unity Catalog Volumes** to manage files in cloud storage  
✅ **Visualize entity relationships** using ERDs to understand table schemas  
✅ **View table insights** to understand data usage patterns  

### Key Takeaways:

1. **Catalog Explorer** provides a unified UI for data discovery and governance
2. **SQL commands** like `SHOW TABLES`, `DESCRIBE TABLE`, and `DESCRIBE HISTORY` enable programmatic exploration
3. **Volumes** provide secure, managed access to files in cloud object storage
4. **Entity Relationship Diagrams** visualize primary/foreign key relationships
5. **Table Insights** show query patterns and usage metrics

**Additional Resources:**
- [Discover Data](https://docs.databricks.com/aws/en/discover/)
- [Catalog Explorer](https://docs.databricks.com/aws/en/catalog-explorer/)
- [Unity Catalog](https://docs.databricks.com/aws/en/data-governance/unity-catalog/)
