### Lesson
- In this lesson, we will focus on ingesting CSV files into Delta Lake using CTAS pattern with the read_files() method and explore the rescued data column

### Objectives
1. Ingest CSV files using CTAS statement with the read_files() function.
2. Define and apply an explicit schema with read_files() to ensure consistent and relaible data ingestion.
3. Handle and inspect rescued data that does not conform to the defined schema.


### Setup the lab.

In [0]:
%run ../01_Data_Engineer_Learning_Plan/Lab-Setup/lab-setup-04

### Explore the files in our volume
- There should be 4 csv files

In [0]:
LIST '/Volumes/workspace/data_engineering_labs_00/v01/raw/sales-csv/'

### Explore the csv files 
- We query the csv files by doing `read_files()`, with selected options



In [0]:
SELECT *
FROM read_files(
  "/Volumes/workspace/data_engineering_labs_00/v01/raw/sales-csv/",
  format => "csv",
  sep => ",",
  header => true
)
  LIMIT 5

### Use CTAS to create a delta table
- include metadata columns

In [0]:
-- Drop if table exists
DROP TABLE IF exists sales_bronze;

-- CREATE delta table
CREATE TABLE workspace.data_engineering_labs_00.sales_bronze AS
SELECT *,
_metadata.file_modification_time AS file_modification_time,
_metadata.file_name AS source_file,
current_timestamp() as ingestion_time 
FROM read_files(
  "/Volumes/workspace/data_engineering_labs_00/v01/raw/sales-csv/",
  format => "csv",
  sep => ",",
  header => true
);

--Display
SELECT *
FROM sales_bronze

### Explore the Sales_bronze table
- Use DESCRIBE TABLE EXTENDED
- Notice that schema is inferred if one is not provided.
- This is seen in column types being provided 

In [0]:
DESCRIBE TABLE EXTENDED sales_bronze

### Python Equivalent of CTAS

In [0]:
%python
df = (spark
      .read
      .option("header",True)
      .option("sep",",")
      .option("rescuedDataColumn","_rescued_data")
      .csv("/Volumes/workspace/data_engineering_labs_00/v01/raw/sales-csv/")
)

#display
df.display()

# Write table, if you wish
# df.write.mode("overwrite").saveAsTable("users_bronze_table")

### Explore the malformed file 01
- Look at malformed file in `/Volumes/workspace/data_engineering_labs_00/v01/ops/csv_demo_files/malformed_example_1_data.csv`
- Notice the first record's transaction_time_stamp = `aaa`, which is wrong.
- Read it in fully to check out the column type. Instead of timestamp, it would be string.

In [0]:
%python
spark.sql(f'''
          SELECT *
          FROM text.`/Volumes/workspace/data_engineering_labs_00/v01/ops/csv_demo_files/malformed_example_1_data.csv`
          ''').display()

In [0]:
-- Note the string type after reading in with read_files()
SELECT * 
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/ops/csv_demo_files/malformed_example_1_data.csv',
  format => "csv",
  sep => ",",
  header => true
);

### Handling type mismatch: Defining the schema
- ensure that transactions_timestamp as DATE
- We note that record with aaa will be placed in the rescued column


In [0]:
SELECT * 
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/ops/csv_demo_files/malformed_example_1_data.csv',
  format => "csv",
  sep => ",",
  schema => '''
  order_id INT,
  email string,
  transactions_timestamp DATE,
  total_item_quantity INT,
  purchase_revenue_in_usd DOUBLE,
  unique_items INT,
  items STRING
  ''',
  header => true,
  rescueddatacolumn => '_rescued_data'
);

### Handling Missing Headers During Ingestion
- Explore the data. 
- Note that it only has 6 headers, while records are separated into 7 portions.
- When reading into a table, the headers and body will be missmatched.

In [0]:
%python
spark.sql(f'''
          SELECT *
          FROM text.`/Volumes/workspace/data_engineering_labs_00/v01/ops/csv_demo_files/malformed_example_2_data.csv`
          ''').display()

In [0]:

DROP TABLE IF EXISTS malformed_header_bronze;

CREATE OR REPLACE TABLE malformed_header_bronze AS
SELECT * 
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/ops/csv_demo_files/malformed_example_2_data.csv',
  format => "csv",
  sep => ",",
  header => true
);

SELECT * 
FROM malformed_header_bronze
LIMIT 1

### Note: 
- This tutorial should have placed items into _rescued_column, but somehow it is not. To ignore and move on for now.