# AWS Athena Integration: Symlink Manifest

## Overview
In the previous notebooks, we loaded data into Landing, Staging, and the Data Warehouse using Delta Lake. However, Delta Lake manages data using a transaction log and multiple Parquet files across different versions. 

**The Challenge:**
Standard query engines like AWS Athena (Presto) simply look at a folder and read all files. If they read a Delta Lake folder directly, they will read **all versions** of the data (including deleted or stale parquet files), resulting in incorrect or duplicated data.

**The Solution: Symlink Manifest**
To allow Athena to read Delta tables correctly, we generate a **Symlink Manifest**. This is a text file that contains the list of *only* the valid parquet files for the current snapshot of the table.

## Objective
In this notebook, we will:
1.  Define the SQL DDLs required to create tables in Athena that point to these manifests.
2.  Use Python (`boto3`) to automate the creation of the Databases and Tables in the AWS Glue Catalog.

In [None]:
# Import necessary libraries
import boto3
import time

# Configuration
# Ensure you have your AWS Credentials configured in your environment 
# (e.g., ~/.aws/credentials or via environment variables)
aws_region = "ap-south-1" # Change to your region
athena_output_loc = "s3://warehouse/target/athena_output/" # Created in video

# Initialize Athena Client
athena_client = boto3.client('athena', region_name=aws_region)

print(f"Athena Client Initialized for region: {aws_region}")
print(f"Query Output Location: {athena_output_loc}")

## 1. SQL DDL Definitions
Below are the SQL commands derived from the video. Note the specific configuration for Delta Lake compatibility:
*   **Row Format:** `ParquetHiveSerDe`
*   **Input Format:** `SymlinkTextInputFormat`
*   **Location:** Points specifically to the `_symlink_format_manifest` directory inside the table folder.

In [None]:
# 1. Database Creation DDLs
create_db_sqls = [
    "CREATE DATABASE IF NOT EXISTS edw_ld",
    "CREATE DATABASE IF NOT EXISTS edw_stg",
    "CREATE DATABASE IF NOT EXISTS edw"
]

# 2. Table Creation DDLs

# Landing Table
ddl_landing = """
CREATE EXTERNAL TABLE IF NOT EXISTS edw_ld.dim_date_ld (
    `date` string,
    `day` string,
    `month` string,
    `year` string,
    `day_of_week` string,
    `insert_dt` timestamp,
    `rundate` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://warehouse/landing/dim_date_ld/_symlink_format_manifest/'
"""

# Staging Table
ddl_staging = """
CREATE EXTERNAL TABLE IF NOT EXISTS edw_stg.dim_date_stg (
    `date` date,
    `day` int,
    `month` int,
    `year` int,
    `day_of_week` string,
    `insert_dt` timestamp,
    `update_dt` timestamp,
    `rundate` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://warehouse/staging/dim_date_stg/_symlink_format_manifest/'
"""

# Dimension (DW) Table
ddl_dw = """
CREATE EXTERNAL TABLE IF NOT EXISTS edw.dim_date (
    `row_wid` bigint,
    `date` date,
    `day` int,
    `month` int,
    `year` int,
    `day_of_week` string,
    `rundate` string,
    `insert_dt` timestamp,
    `update_dt` timestamp
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://warehouse/edw/dim_date/_symlink_format_manifest/'
"""

table_sqls = [ddl_landing, ddl_staging, ddl_dw]

In [None]:
# Helper function to execute Athena Queries
def run_athena_query(query, database, output_location):
    try:
        response = athena_client.start_query_execution(
            QueryString=query,
            QueryExecutionContext={'Database': database},
            ResultConfiguration={'OutputLocation': output_location}
        )
        query_execution_id = response['QueryExecutionId']
        
        # Wait for query to complete
        state = 'RUNNING'
        while state in ['RUNNING', 'QUEUED']:
            time.sleep(1)
            res = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
            state = res['QueryExecution']['Status']['State']
            
        if state == 'SUCCEEDED':
            print(f"Query Succeeded: {query[:50]}...")
        else:
            print(f"Query Failed: {res['QueryExecution']['Status']['StateChangeReason']}")
            
    except Exception as e:
        print(f"Error executing query: {e}")

# 1. Create Databases
print("--- Creating Databases ---")
for sql in create_db_sqls:
    run_athena_query(sql, 'default', athena_output_loc)

# 2. Create Tables
print("\n--- Creating Tables ---")
# Note: We pass 'default' as context, but the table names in DDL have schema prefixes (e.g., edw_ld.table)
for sql in table_sqls:
    run_athena_query(sql, 'default', athena_output_loc)

## 2. Verification
Now that the tables are created, we can query the Data Warehouse table (`edw.dim_date`) to confirm that Athena can read the data correctly via the manifest.

In [None]:
# Query the final Dimension table
verify_sql = "SELECT * FROM edw.dim_date LIMIT 10"

print("\n--- Verifying Data Access ---")
try:
    # Run the select query
    response = athena_client.start_query_execution(
        QueryString=verify_sql,
        QueryExecutionContext={'Database': 'edw'},
        ResultConfiguration={'OutputLocation': athena_output_loc}
    )
    query_id = response['QueryExecutionId']
    
    # Wait for completion
    time.sleep(3) 
    
    # Get Results
    results = athena_client.get_query_results(QueryExecutionId=query_id)
    
    # Simple print of result rows
    rows = results['ResultSet']['Rows']
    for row in rows:
        data = [col.get('VarCharValue', 'NULL') for col in row['Data']]
        print("\t".join(data))
        
except Exception as e:
    print(f"Verification failed. Ensure AWS credentials are set and tables were created. Error: {e}")