# Lakehouse Federation: Querying External Data

## Introduction
**Lakehouse Federation** is a powerful capability in Databricks that allows you to query data residing in external databases (like PostgreSQL, MySQL, Snowflake, Redshift, SQL Server, etc.) **without moving or copying the data** into Databricks.

### Key Benefits:
1.  **Unified Interface:** Access external data using standard Spark/Databricks SQL.
2.  **No Data Movement:** Solves the "Data Silo" problem without building complex ETL pipelines just for ad-hoc analysis.
3.  **Unified Governance:** Apply Unity Catalog permissions and lineage tracking to external data sources.

In this notebook, we will learn how to set up Lakehouse Federation using **Connections** and **Foreign Catalogs**.

## The Architecture
To query external data, Unity Catalog uses a two-level object structure:

1.  **Connection:** A secured object that stores the details to connect to the external system (Host, Port, User, Password/Secret).
2.  **Foreign Catalog:** A catalog inside Unity Catalog that mirrors the database structure of the external system.

Once set up, the external tables appear inside the Foreign Catalog just like standard Delta tables.

## Step 1: Create a Connection
First, we need to define how Databricks connects to the external source (e.g., PostgreSQL).

*Note: In a production environment, never hardcode passwords. Use **Databricks Secrets** (e.g., `secret('scope', 'key')`).*

**Prerequisite:** You need `CREATE CONNECTION` privileges on the Metastore.

In [None]:
-- Template to create a connection to a PostgreSQL database
-- Replace the values with your actual database details

CREATE CONNECTION IF NOT EXISTS pgsql_connection
TYPE postgresql
OPTIONS (
  host 'postgres-host-url.com',
  port '5432',
  user 'postgres_user',
  password 'your_password' -- Better practice: secret('my_scope', 'pg_password')
);

## Step 2: Create a Foreign Catalog
Now that the connection is established, we map a specific database inside that external system to a **Catalog** in Databricks.

Any table inside the external database will automatically become visible under this catalog.

In [None]:
-- Create a Foreign Catalog that mirrors the 'demo' database in Postgres
CREATE FOREIGN CATALOG IF NOT EXISTS pgsql_aws_catalog
USING CONNECTION pgsql_connection
OPTIONS (database 'demo');

## Step 3: Explore External Data
Once the catalog is created, you can browse it just like a local catalog. The schema and tables are fetched dynamically.

In [None]:
-- List schemas in the foreign catalog
SHOW SCHEMAS IN pgsql_aws_catalog;

In [None]:
-- List tables in the public schema of the foreign catalog
SHOW TABLES IN pgsql_aws_catalog.public;

## Step 4: Query and Join (Federation in Action)
The real power of Lakehouse Federation is the ability to join **External Data** (e.g., in Postgres) with **Native Data** (Delta Tables in Databricks).

**Scenario:**
*   We have `employees` and `department` tables living in AWS RDS (Postgres).
*   We want to join them to calculate employee counts per department using Databricks SQL.

In [None]:
-- Simple Select from the external table
SELECT * FROM pgsql_aws_catalog.public.employees LIMIT 10;

In [None]:
-- Federated Join Query
-- Joining two tables that physically reside in Postgres, executed via Databricks
SELECT 
  d.department_name,
  COUNT(1) as emp_count
FROM pgsql_aws_catalog.public.employees AS e
LEFT OUTER JOIN pgsql_aws_catalog.public.department AS d
  ON e.department_id = d.department_id
GROUP BY d.department_name
ORDER BY emp_count DESC;

## Step 5: Governance and Lineage
Even though the data lives outside Databricks:

1.  **Permissions:** You can grant access to the `Foreign Catalog` to specific Databricks groups/users using standard SQL.
    ```sql
    GRANT USE CATALOG ON CATALOG pgsql_aws_catalog TO `account-users`;
    GRANT SELECT ON SCHEMA pgsql_aws_catalog.public TO `data_analysts`;
    ```

2.  **Lineage:** If you create a new Databricks table using the query above, Unity Catalog will show the lineage tracing back to the Postgres connection.

In [None]:
# Cleanup (Optional)
# Uncomment the lines below if you want to remove the resources created in this demo.

# spark.sql("DROP CATALOG IF EXISTS pgsql_aws_catalog CASCADE")
# spark.sql("DROP CONNECTION IF EXISTS pgsql_connection")