# Ingest Parquet Files to Delta Tables

This notebook ingests the three parquet files as Spark DataFrames and writes them to Unity Catalog as Delta tables.

In [None]:
# Configuration
CATALOG_NAME = "your_catalog_name"  # Replace with your catalog name
SCHEMA_NAME = "your_schema_name"    # Replace with your schema name

# Local data directory (relative to notebook location)
LOCAL_DATA_PATH = "./data"

# Target schema for Delta tables
TARGET_SCHEMA = f"{CATALOG_NAME}.{SCHEMA_NAME}"

print(f"Reading from: {LOCAL_DATA_PATH}")
print(f"Writing to schema: {TARGET_SCHEMA}")

## Create Schema if it doesn't exist

In [None]:
# Create schema if it doesn't exist
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {TARGET_SCHEMA}")
print(f"Schema {TARGET_SCHEMA} created or already exists")

## 1. Ingest Balance Sheet Data

In [None]:
# Read balance sheet parquet file
balance_sheet_path = f"{LOCAL_DATA_PATH}/balance-sheet.parquet"
balance_sheet_df = spark.read.parquet(balance_sheet_path)

print("Balance Sheet Data:")
print(f"Rows: {balance_sheet_df.count()}")
print(f"Columns: {len(balance_sheet_df.columns)}")
print("\nSchema:")
balance_sheet_df.printSchema()
print("\nSample data:")
balance_sheet_df.show(5)

In [None]:
# Write balance sheet to Delta table
balance_sheet_table = f"{TARGET_SCHEMA}.balance_sheet"

balance_sheet_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable(balance_sheet_table)

print(f"✅ Balance sheet data written to {balance_sheet_table}")

## 2. Ingest Income Statement Data

In [None]:
# Read income statement parquet file
income_statement_path = f"{LOCAL_DATA_PATH}/income-statement.parquet"
income_statement_df = spark.read.parquet(income_statement_path)

print("Income Statement Data:")
print(f"Rows: {income_statement_df.count()}")
print(f"Columns: {len(income_statement_df.columns)}")
print("\nSchema:")
income_statement_df.printSchema()
print("\nSample data:")
income_statement_df.show(5)

In [None]:
# Write income statement to Delta table
income_statement_table = f"{TARGET_SCHEMA}.income_statement"

income_statement_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable(income_statement_table)

print(f"✅ Income statement data written to {income_statement_table}")

## 3. Ingest SEC 10K Chunked Data

In [None]:
# Read SEC 10K chunked parquet file
sec_10k_path = f"{LOCAL_DATA_PATH}/sec-10k-chunked.parquet"
sec_10k_df = spark.read.parquet(sec_10k_path)

print("SEC 10K Chunked Data:")
print(f"Rows: {sec_10k_df.count()}")
print(f"Columns: {len(sec_10k_df.columns)}")
print("\nSchema:")
sec_10k_df.printSchema()
print("\nSample data:")
sec_10k_df.show(5, truncate=False)

In [None]:
# Write SEC 10K chunked data to Delta table
sec_10k_table = f"{TARGET_SCHEMA}.sec_10k_chunked"

sec_10k_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable(sec_10k_table)

print(f"✅ SEC 10K chunked data written to {sec_10k_table}")

## 4. Verify Delta Tables Creation

In [None]:
# List all tables in the schema
tables_df = spark.sql(f"SHOW TABLES IN {TARGET_SCHEMA}")
print(f"Tables in {TARGET_SCHEMA}:")
tables_df.show()

# Get table details
table_names = [
    f"{TARGET_SCHEMA}.balance_sheet",
    f"{TARGET_SCHEMA}.income_statement", 
    f"{TARGET_SCHEMA}.sec_10k_chunked"
]

print("\nTable Details:")
for table in table_names:
    try:
        count = spark.table(table).count()
        print(f"  {table}: {count:,} rows")
    except Exception as e:
        print(f"  {table}: Error - {e}")

## 5. Test Queries on Delta Tables

In [None]:
# Test query on balance sheet
print("Sample Balance Sheet Query:")
spark.sql(f"""
    SELECT * 
    FROM {TARGET_SCHEMA}.balance_sheet 
    LIMIT 3
""").show()

In [None]:
# Test query on income statement
print("Sample Income Statement Query:")
spark.sql(f"""
    SELECT * 
    FROM {TARGET_SCHEMA}.income_statement 
    LIMIT 3
""").show()

In [None]:
# Test query on SEC 10K data
print("Sample SEC 10K Query:")
spark.sql(f"""
    SELECT * 
    FROM {TARGET_SCHEMA}.sec_10k_chunked 
    LIMIT 3
""").show(truncate=False)

## Summary

✅ All three parquet files have been successfully ingested as Delta tables:

1. **Balance Sheet**: `{catalog}.{schema}.balance_sheet`
2. **Income Statement**: `{catalog}.{schema}.income_statement` 
3. **SEC 10K Chunked**: `{catalog}.{schema}.sec_10k_chunked`

These tables are now ready to be used for:
- Creating a Genie Space (balance_sheet and income_statement tables)
- Creating a Vector Search Index (sec_10k_chunked table)
- Building Agent Bricks components