### Learning Objectives
- Objective 1: Use the CTAS statement with `read_files()` to ingest Parquet files into a Delta table
- Objective 2: Use COPY INTO to incrementally load Parquet files from cloud object storage into a Delta table
    - Usage of `COPY_OPTIONS ('mergeSchema' = 'true')` to handle schema changes
- Objective 3: Review of Managed vs External Tables

### Set Up the Lab and Confirm our default catalog and schema
- `%run ../01_Data_Engineer_Learning_Plan/Lab-Setup/lab-setup-01 `
    - Include argument `run_id` = `02`, `03`, `04`...etc for multiple environments
    - This will create a catalog.schema called workspace.data_engineering _labs_00
    - There will be 4 parquet files totalling 10,000 records created under V01/raw/users-historical
- Use `current_catalog()` and `current_schema()` to confirm it.

In [0]:
%run ../01_Data_Engineer_Learning_Plan/Lab-Setup/lab-setup-01 

In [0]:
%sql
SELECT current_catalog(), current_schema()

### Explore the Data Files
1. We will create an table from Parquet files stored in our volume
`/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/`

2. We can use the `dbutils.fs.ls` statement to view the files in our volume.

3. We can also query the parquet files by path to quickly preview the files in table form.


In [0]:
path = "/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/"
files = dbutils.fs.ls(path)
display([(f.name, f.size) for f in files])

In [0]:
%sql

SELECT * 
FROM parquet.`/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/` -- use backticks for filepath

### Batch Ingestion using CTAS with the `read_files()` Function
- Note: A `_rescued_data` column is automatically included to capture any data that does not match inferred schema.

In [0]:
%sql

-- READ files
SELECT *
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/',
  format => 'parquet'
)
LIMIT 10;


- We will use a CTAS statement to create the table `historical_users_bronze_ctas_rf`
- Table type is Delta by default

In [0]:
%sql

-- Drop table 
DROP TABLE IF EXISTS historical_users_bronze_ctas_rf;

-- Create Delta Table
CREATE TABLE historical_users_bronze_ctas_rf AS
SELECT *
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/',
  format => 'parquet'
);

--Preview
SELECT * 
FROM historical_users_bronze_ctas_rf
LIMIT 10;

We can run `DESCRIBE TABLE EXTENDED` to view column names, data types and additional table metadata
- Note that:
    - the table was created in our catalog `workspace`
    - schema = `data_engineering_labs_00`
    - Table Type = `MANAGED`


In [0]:
%sql 
DESCRIBE TABLE EXTENDED historical_users_bronze_ctas_rf

### Managed vs External tables
1. Managed
  - DB manages both data and metadata
  - Data is stored in DB managed storage
  - Dropping table also deletes data
2. External
  - DB only manages table metadata
  - Dropping table does not delete data
  - Supports multiple formats including Delta Lake
  - Ideal for sharing data across platforms or using. external data

### Batch Python Ingestion 
- This code uses python as an alteernative to SQL to ingest the files

In [0]:
file_path = "/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/"

# 1.  Read the parquet files from the volumn into a Spark dataframe
df = spark.read\
    .format("parquet")\
    .load(file_path)

# 2. Write df to a delta table
df.write.mode("overwrite").saveAsTable("workspace.data_engineering_labs_00.historical_users_bronze_python")

# 3. read and display
users_bronze_table = spark.table("workspace.data_engineering_labs_00.historical_users_bronze_python")
display(users_bronze_table)

### Incremental Data Ingestion with `COPY INTO`
- Exisitng files are tracked and will be skipped
  - Useful when we need to load data into an exisitng Delta table
- Note that moving ahead, `COPY INTO` is considered legacy and Auto Loader is recommneded instead for incremntal ingestion.
- We will use the same set of files to create our Bronze table again.
- There will also be 2 examples
  1. Create Table with Schema then handle Common Schema Mismatch Error
  2. Create Table without Schema then Preemptively Handling schema evolution

#### Example 1: Create Table with Schema then handle Common Schema Mismatch Error

- Empty table `historical_users_bronze_ci` created with a defined schema for columns
  -  `user_id`
  - `user_first_touch_timestamp` 

- However, the Parquet files has 3 columns. 
  -  `user_id`
  - `user_first_touch_timestamp` 
  - `email`

The difference in schema cause the error. 
- We fix the error by adding `COPY_OPTIONS` with `mergeSchema` = True
- This allows the schema to evolve based on incoming data



In [0]:
%sql

-- DROP Table 
DROP TABLE IF EXISTS historical_users_bronze_ci;

-- Create empty table with 2 columns only
CREATE TABLE historical_users_bronze_ci (
  user_id LONG,
  user_first_touch_timestamp BIGINT
);

-- USE COPY INTO to populate Delta table
COPY INTO historical_users_bronze_ci
FROM '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/'
FILEFORMAT = PARQUET;


In [0]:
%sql

-- USE COPY INTO to populate Delta table with mergeSchema = True
COPY INTO historical_users_bronze_ci
FROM '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/'
FILEFORMAT = PARQUET
COPY_OPTIONS ('mergeSchema' = 'true'); -- merges schema of each file

#### Example 2: Create Table without Schema then Preemptively Handling schema evolution
- We can also create atable without schema, then adding `COPY_OPTIONS` with `mergeSchema` = True like before
- This will enable schema evolution for the table 

In [0]:
%sql
-- DROP Table 
DROP TABLE IF EXISTS historical_users_bronze_ci_no_schema;

-- Create empty table with 2 columns only
CREATE TABLE historical_users_bronze_ci_no_schema;

-- USE COPY INTO to populate Delta table
COPY INTO historical_users_bronze_ci_no_schema
FROM '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/'
FILEFORMAT = PARQUET
COPY_OPTIONS ('mergeSchema' = 'true');



- Note that due to incremnetal batch, next run will creeatee no new rows

In [0]:
%sql
-- USE COPY INTO to populate Delta table
COPY INTO historical_users_bronze_ci_no_schema
FROM '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/'
FILEFORMAT = PARQUET
COPY_OPTIONS ('mergeSchema' = 'true');
