## Data Lake House
- A Data Lakehouse is a hybrid data architecture that combines the flexibility and scalability of a data lake with the performance and governance features of a data warehouse. 
- It enables organizations to run BI, machine learning (ML), and real-time analytics on a single platform.

### features
- structured, semi-structured, and unstructured data
- ACID 
- Schema enforcement
- Time Travel & Data Versioning
- Batch & stream workloads
* The underlying storage layer is cloud object storage only

### Databricks uses Delta live tables
#### Snowflake uses ApacheIceberg or Aws athena
- with help of apache iceberg snowflake directly queries from data lake storage with out ingesting into snowflake tables


## Delta Lake
- stores data in parquet format and columnar 
- Delta Lake adds a transaction log (_delta_log) that tracks all changes to the data and ensures ACID compliance.
- since ACID schema enforcement
- time travel is available with transaction log various versions of data is saved.
- manages infrastructure at scale 

## Declarative vs procedural 
- built in data quality checks
- automates medallion architecture
- monitoring and lineage tracking
- automatic performance optimization

### Components
- Delta table - a table that stores all of the data.
- Delta log -  transaction log 
- Delta Cache - transaction cache which stores recent versions of data.

- Upserts are done via MERGE or INSERT INTO
- checkpoints are used for recovery
- snapshots for rollback 


In [None]:
--- Live Table creation syntax
CREATE LIVE TABLE silver_sales
AS SELECT * FROM STREAM(live.bronze_sales)
WHERE EXPECT(amount > 0, "Transaction amount must be positive");


In [None]:
-- read from another live table
Create or refresh LIVE table top_five 
As select * from live.silver_sales
limit 5;
-- a key word temporary can be added to create temporary tables

### DLT pipeline via API

In [None]:
#create json pay loads
pipeline_payload = {
  "name": "DLT_Pipeline_API",
  "storage": "dbfs:/pipelines/dlt_pipeline_api", ##storage for pipeline logs 
  "target": "dlt_target_db", ## target db for pipeline
  "development": false,
  "clusters": [
    {
      "label": "default",
      "num_workers": 2
    }
  ],
  "libraries": [
    {
      "notebook": {
        "path": "/Repos/your_repo/dlt_notebook" ##notebook path for transformations
      }
    }
  ],
  "edition": "ADVANCED",
  "photon": true,
  "continuous": true # can be triggered
}


In [None]:
## create the DLT pipeline Using API 
curl -X POST https://<DATABRICKS_WORKSPACE>/api/2.0/pipelines \
-H "Authorization: Bearer <DATABRICKS_TOKEN>" \
-H "Content-Type: application/json" \
-d @pipeline_config.json

In [None]:
import requests
import json

# Databricks API credentials
DATABRICKS_HOST = "https://<DATABRICKS_WORKSPACE>"
TOKEN = "<DATABRICKS_TOKEN>"

# Pipeline Configuration
pipeline_config = {
    "name": "DLT_Pipeline_API",
    "storage": "dbfs:/pipelines/dlt_pipeline_api",
    "target": "dlt_target_db",
    "development": False,
    "clusters": [{"label": "default", "num_workers": 2}],
    "libraries": [{"notebook": {"path": "/Repos/your_repo/dlt_notebook"}}],
    "edition": "ADVANCED",
    "photon": True,
    "continuous": True
}

# API Call
headers = {"Authorization": f"Bearer {TOKEN}", "Content-Type": "application/json"}
response = requests.post(f"{DATABRICKS_HOST}/api/2.0/pipelines", headers=headers, data=json.dumps(pipeline_config))

# Print response
print(response.json())


- the response contains the pipeline id which can be referred for starting and monitoring
- we can curl or python to get the list of all the pipelines 

In [None]:
## start the pipeline 
curl -X POST https://<DATABRICKS_WORKSPACE>/api/2.0/pipelines/<PIPELINE_ID>/start \
-H "Authorization: Bearer <DATABRICKS_TOKEN>"


In [None]:
## status
curl -X GET https://<DATABRICKS_WORKSPACE>/api/2.0/pipelines/<PIPELINE_ID>/status \
-H "Authorization: Bearer <DATABRICKS_TOKEN>"


In [None]:
-- create eventlog using pipeline id
CREATE VIEW event_log_raw AS SELECT * FROM event_log("<pipeline-ID>")

In [None]:
-- Query lineage informations
 SELECT
  details:flow_definition.output_dataset as output_dataset,
  details:flow_definition.input_datasets as input_dataset
 FROM
  event_log_raw
 WHERE
  event_type = 'flow_definition'

In [None]:
-- Query data quality information
SELECT
  details:flow_progress.data_quality.expectations
 FROM
  event_log_raw  ---> event log table in unity catalog
 WHERE
  event_type = 'flow_progress'

In [None]:
-- get back log data 
 SELECT
 timestamp,
 Double(details :flow_progress.metrics.backlog_bytes) as backlog
 FROM
  event_log_raw
 WHERE
  event_type ='flow_progress'

In [None]:
-- when autoscaling is enabled cluster resize is done this info is also stored 
 SELECT
 details:autoscale 
 FROM
  event_log_raw
 WHERE
  event_type ='autoscale'

In [None]:
-- check for cluster resources 
SELECT
 timestamp,
 Double(details :cluster_resources.avg_num_queued_tasks) as queue_size
 FROM
  event_log_raw
 WHERE
  event_type = 'cluster_resources';

-- avg_task_slot_utilization
-- num_executors
-- latest_requested_num_executors
-- optimal_num_executors
-- state

In [None]:
-- user actions
SELECT timestamp, details:user_action:action, details:user_action:user_name FROM
 event_log_raw WHERE event_type = 'user_action';
