### Lesson:
- In this lesson, we will learn how to add metadata columns during ingestion

### Objectives
- Modify columns durin data ingestion from cloud stroage to bronze table
- Add current ingestion timestamp to the bronze
- Use the _metadata column to extract the file-level metadata

### 01 Setup:
- We will be using the same data from "01 Data Ingestion with CREATE TABLE AS and COPY INTO" lab
- If not already created,. Run `%run ../01_Data_Engineer_Learning_Plan/Lab-Setup/lab-setup-01 `

In [0]:
-- View the files in our volume that shoudl contain our parquet files
LIST '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical/'

### 02 Adding MetaData columns to the Bronze Table during Ingestion
- To include the '_metadata' column, we have to explicitly select it in the read query
- Lets also try to:
  1. Convert parquet timestamp to a DATE column
  2. Include input file name 
  3. Include last modification timestamp
  4. Add file ingestion time


In [0]:
-- READ sample data form our parquet files
SELECT * 
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical',
  format => 'parquet'
)
LIMIT 10;

- Use `from_unixtime()` function to create a readable date column
- Also divide by 10000000 to change from microseconds to seconds 

In [0]:
--- Convert Unixtime on ingestion to Bronze
SELECT *,
CAST(from_unixtime(user_first_touch_timestamp/1000000) AS DATE) as first_touch_date
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical',
  format => 'parquet'
)
LIMIT 10;


- We can add these metadata
  - `_.metadta.file_modification`
  - `_.metadta.file_name`
  - `current_timestamp()`

In [0]:
SELECT *,
CAST(from_unixtime(user_first_touch_timestamp/1000000) AS DATE) as first_touch_date,
_metadata.file_name as file_name,
_metadata.file_modification_time as file_modification_time,
current_timestamp() as ingestion_time
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical',
  format => 'parquet'
)
LIMIT 10;

### Put it all together, we can create our Bronze Delta Table

In [0]:
-- DROP Table and recreate
DROP TABLE IF EXISTS historical_users_bronze;

-- Create empoty table
CREATE TABLE historical_users_bronze AS 

SELECT *,
CAST(from_unixtime(user_first_touch_timestamp/1000000) AS DATE) as first_touch_date,
_metadata.file_name as file_name,
_metadata.file_modification_time as file_modification_time,
current_timestamp() as ingestion_time
FROM read_files(
  '/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical',
  format => 'parquet'
);

-- View final bronze table
SELECT * FROM historical_users_bronze LIMIT 10

### Optional Exploration
- Count how many rows came form each parquet file

In [0]:
SELECT
file_name as source_file, count(*) as total
FROM historical_users_bronze
GROUP BY source_file
ORDER BY source_file DESC

### Python Equivalent

In [0]:
%python
from pyspark.sql.functions import col, from_unixtime, current_timestamp
from pyspark.sql.types import DateType

# Read parquet
df = spark.read.format("parquet").load('/Volumes/workspace/data_engineering_labs_00/v01/raw/users-historical')

## Add Metadata columns
df_with_metadata = df.withColumn("first_touch_date", from_unixtime(col("user_first_touch_timestamp")/1000000).cast(DateType()))\
.withColumn("file_modification_time", col("_metadata.file_modification_time"))\
.withColumn("source_file", col("_metadata.file_name"))\
.withColumn("ingestion_time", current_timestamp())

# Save as delta table
df_with_metadata.write.format("delta").mode("overwrite").saveAsTable("workspace.data_engineering_labs_00.users_historical_bronze_python_metadata")

#Raed and display table
users_historical_bronze_python_metadata = spark.table("workspace.data_engineering_labs_00.users_historical_bronze_python_metadata")
display(users_historical_bronze_python_metadata)