# Prepare and transform data in the lakehouse (Spark SQL)

This is a companion notebook for the Microsoft Learn tutorial: https://learn.microsoft.com/fabric/data-engineering/tutorial-lakehouse-data-preparation
The website tutorial flow follows Path 1 in this notebook.

This notebook includes two execution paths:
- Path 1: Lakehouse schemas enabled (`Tables/dbo/...`) — this is the supported path for the tutorial article.
- Path 2: Lakehouse schemas not enabled (`wwilakehouse....`) — use this alternate path when schemas are not enabled in your environment.

## Path 1 - Lakehouse schemas enabled (tutorial-supported path)
### Create Delta tables
Run these cells to create Delta tables from raw data using schema-qualified paths (`Tables/dbo/...`).

#### Cell 1 - Spark session configuration
This cell enables two Fabric features that optimize how data is written and read in subsequent cells. V-order optimizes parquet layout for faster reads and better compression. Optimize Write reduces the number of files written and increases individual file size.

In [None]:
%%sql
SET spark.sql.parquet.vorder.enabled=true;
SET spark.microsoft.delta.optimizeWrite.enabled=true;
SET spark.microsoft.delta.optimizeWrite.binSize=1073741824;

#### Cell 2 - Fact - Sale
This cell reads raw parquet data from `Files/wwi-raw-data/full/fact_sale_1y_full`, adds date part columns (`Year`, `Quarter`, and `Month`), and writes `fact_sale` as a Delta table partitioned by `Year` and `Quarter`.

In [None]:
%%sql
CREATE OR REPLACE TABLE delta.`Tables/dbo/fact_sale`
USING DELTA
PARTITIONED BY (Year, Quarter)
AS
SELECT
   *,
   year(InvoiceDateKey) AS Year,
   quarter(InvoiceDateKey) AS Quarter,
   month(InvoiceDateKey) AS Month
FROM parquet.`Files/wwi-raw-data/full/fact_sale_1y_full`;

#### Cell 3 - Dimensions
This cell reads the five dimension parquet datasets and writes them as Delta tables (`dimension_city`, `dimension_customer`, `dimension_date`, `dimension_employee`, and `dimension_stock_item`) under `Tables/dbo/...`.

In [None]:
%%sql
CREATE OR REPLACE TABLE delta.`Tables/dbo/dimension_city` USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_city`;
CREATE OR REPLACE TABLE delta.`Tables/dbo/dimension_customer` USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_customer`;
CREATE OR REPLACE TABLE delta.`Tables/dbo/dimension_date` USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_date`;
CREATE OR REPLACE TABLE delta.`Tables/dbo/dimension_employee` USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_employee`;
CREATE OR REPLACE TABLE delta.`Tables/dbo/dimension_stock_item` USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_stock_item`;

### Transform data for business aggregates
Run the transformation cells to create aggregate outputs for reporting in the schema-enabled path.

#### Cell 4 - Load source tables
For Spark SQL, no additional load step is required in this notebook. The source Delta tables are used directly in the SQL statements below.

#### Cell 5 - Aggregate sale by date and city
This cell computes monthly sales totals by city and materializes the result as `aggregate_sale_by_date_city` using `CREATE OR REPLACE TABLE ... AS SELECT`.

In [None]:
%%sql
CREATE OR REPLACE TABLE delta.`Tables/dbo/aggregate_sale_by_date_city`
AS
SELECT
    fs.Year,
    fs.Month,
    c.City,
    c.StateProvince,
    c.SalesTerritory,
    SUM(fs.TotalExcludingTax) AS sum_of_total_excluding_tax,
    SUM(fs.TaxAmount) AS sum_of_tax_amount,
    SUM(fs.Profit) AS sum_of_profit
FROM delta.`Tables/dbo/fact_sale` fs
INNER JOIN delta.`Tables/dbo/dimension_city` c
    ON fs.CityKey = c.CityKey
GROUP BY
    fs.Year,
    fs.Month,
    c.City,
    c.StateProvince,
    c.SalesTerritory

#### Cell 6 - Aggregate sale by date and employee
This cell computes monthly sales totals by employee and writes the result to `aggregate_sale_by_date_employee` using `CREATE OR REPLACE TABLE ... AS SELECT`.

In [None]:
%%sql
CREATE OR REPLACE TABLE delta.`Tables/dbo/aggregate_sale_by_date_employee`
AS
SELECT
    fs.Year,
    fs.Month,
    e.Employee,
    e.IsSalesperson,
    SUM(fs.TotalExcludingTax) AS sum_of_total_excluding_tax,
    SUM(fs.TaxAmount) AS sum_of_tax_amount,
    SUM(fs.Profit) AS sum_of_profit
FROM delta.`Tables/dbo/fact_sale` fs
INNER JOIN delta.`Tables/dbo/dimension_employee` e
    ON fs.SalespersonKey = e.EmployeeKey
GROUP BY
    fs.Year,
    fs.Month,
    e.Employee,
    e.IsSalesperson

## Path 2 - Lakehouse schemas not enabled (alternate path)
### Create Delta tables
Run these cells to create Delta tables from raw data using non-schema table names (`wwilakehouse....`).

#### Cell 1 - Spark session configuration
This cell enables two Fabric features that optimize how data is written and read in subsequent cells. V-order optimizes parquet layout for faster reads and better compression. Optimize Write reduces the number of files written and increases individual file size.

In [None]:
%%sql
SET spark.sql.parquet.vorder.enabled=true;
SET spark.microsoft.delta.optimizeWrite.enabled=true;
SET spark.microsoft.delta.optimizeWrite.binSize=1073741824;

#### Cell 2 - Fact - Sale
This cell reads raw parquet data from `Files/wwi-raw-data/full/fact_sale_1y_full`, adds date part columns (`Year`, `Quarter`, and `Month`), and writes `wwilakehouse.fact_sale` as a Delta table partitioned by `Year` and `Quarter`.

In [None]:
%%sql
CREATE OR REPLACE TABLE wwilakehouse.fact_sale
USING DELTA
PARTITIONED BY (Year, Quarter)
AS
SELECT
   *,
   year(InvoiceDateKey) AS Year,
   quarter(InvoiceDateKey) AS Quarter,
   month(InvoiceDateKey) AS Month
FROM parquet.`Files/wwi-raw-data/full/fact_sale_1y_full`;

#### Cell 3 - Dimensions
This cell reads the five dimension parquet datasets and writes them as Delta tables (`dimension_city`, `dimension_customer`, `dimension_date`, `dimension_employee`, and `dimension_stock_item`) under `wwilakehouse....`.

In [None]:
%%sql
CREATE OR REPLACE TABLE wwilakehouse.dimension_city USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_city`;
CREATE OR REPLACE TABLE wwilakehouse.dimension_customer USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_customer`;
CREATE OR REPLACE TABLE wwilakehouse.dimension_date USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_date`;
CREATE OR REPLACE TABLE wwilakehouse.dimension_employee USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_employee`;
CREATE OR REPLACE TABLE wwilakehouse.dimension_stock_item USING DELTA AS SELECT * FROM parquet.`Files/wwi-raw-data/full/dimension_stock_item`;

### Transform data for business aggregates
Run the transformation cells to create aggregate outputs for reporting in the non-schema path.

#### Cell 4 - Load source tables
This step prepares the source tables for aggregation in Path 2. For Spark SQL, no additional load statement is required because the SQL queries reference Delta tables directly.

#### Cell 5 - Aggregate sale by date and city
This cell computes monthly sales totals by city and writes the result to `wwilakehouse.aggregate_sale_by_date_city`.

In [None]:
%%sql
CREATE OR REPLACE TABLE wwilakehouse.aggregate_sale_by_date_city
AS
SELECT
    fs.Year,
    fs.Month,
    c.City,
    c.StateProvince,
    c.SalesTerritory,
    SUM(fs.TotalExcludingTax) AS sum_of_total_excluding_tax,
    SUM(fs.TaxAmount) AS sum_of_tax_amount,
    SUM(fs.Profit) AS sum_of_profit
FROM wwilakehouse.fact_sale fs
INNER JOIN wwilakehouse.dimension_city c
    ON fs.CityKey = c.CityKey
GROUP BY
    fs.Year,
    fs.Month,
    c.City,
    c.StateProvince,
    c.SalesTerritory

#### Cell 6 - Aggregate sale by date and employee
This cell computes monthly sales totals by employee and writes the result to `wwilakehouse.aggregate_sale_by_date_employee`.

In [None]:
%%sql
CREATE OR REPLACE TABLE wwilakehouse.aggregate_sale_by_date_employee
AS
SELECT
    fs.Year,
    fs.Month,
    e.Employee,
    e.IsSalesperson,
    SUM(fs.TotalExcludingTax) AS sum_of_total_excluding_tax,
    SUM(fs.TaxAmount) AS sum_of_tax_amount,
    SUM(fs.Profit) AS sum_of_profit
FROM wwilakehouse.fact_sale fs
INNER JOIN wwilakehouse.dimension_employee e
    ON fs.SalespersonKey = e.EmployeeKey
GROUP BY
    fs.Year,
    fs.Month,
    e.Employee,
    e.IsSalesperson