# Analytics & Reporting on AWS Athena

## Overview
We have successfully built an End-to-End Data Lakehouse using PySpark and Delta Lake.
*   **Landing:** Raw Data Ingestion.
*   **Staging:** Cleaning, Deduplication, Transformation.
*   **Data Warehouse:** Star Schema (Facts & Dimensions) with SCD1/SCD2 logic.

Now, we act as Data Analysts. We will use **AWS Athena** (a serverless query engine) to run SQL queries on top of our Delta Lake tables to generate business insights.

## Prerequisites
*   AWS Credentials configured.
*   Athena tables created (via Symlink Manifest in previous steps).
*   `boto3` library installed.

## Reports to Generate
1.  **Total Sales per Store:** Evaluate store performance.
2.  **Top Selling Products:** Identify high-demand items.
3.  **Sales by Plan Type:** Understand customer segmentation revenue.

In [None]:
# Import necessary libraries
import boto3
import time
import pandas as pd
import io

# Configuration
aws_region = "ap-south-1" # Update to your region
athena_output_loc = "s3://warehouse/target/athena_output/"
database = "edw" # The Data Warehouse DB created in Athena

# Initialize Client
athena_client = boto3.client('athena', region_name=aws_region)
s3_client = boto3.client('s3', region_name=aws_region)

print(f"Connected to Athena in region: {aws_region}")

In [None]:
# Helper function to run Athena query and return Pandas DataFrame
def run_query(query, database, output_location):
    try:
        # Start Query
        response = athena_client.start_query_execution(
            QueryString=query,
            QueryExecutionContext={'Database': database},
            ResultConfiguration={'OutputLocation': output_location}
        )
        query_id = response['QueryExecutionId']
        
        # Wait for completion
        while True:
            stats = athena_client.get_query_execution(QueryExecutionId=query_id)
            status = stats['QueryExecution']['Status']['State']
            if status in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
                break
            time.sleep(1)
            
        if status == 'SUCCEEDED':
            # Get Results Key
            path = output_location.replace("s3://", "")
            bucket = path.split("/")[0]
            key = path.replace(bucket + "/", "") + query_id + ".csv"
            
            # Read CSV from S3 directly into Pandas
            obj = s3_client.get_object(Bucket=bucket, Key=key)
            df = pd.read_csv(io.BytesIO(obj['Body'].read()))
            return df
        else:
            print(f"Query Failed: {stats['QueryExecution']['Status']['StateChangeReason']}")
            return None
            
    except Exception as e:
        print(f"Error: {e}")
        return None

## Report 1: Total Sales per Store
We join `fact_sales` with `dim_store` to aggregate total sales revenue by store name.

In [None]:
query_1 = """
SELECT 
    s.store_name, 
    SUM(f.line_total) as total_sales
FROM fact_sales f
JOIN dim_store s ON f.store_wid = s.row_wid
GROUP BY s.store_name
ORDER BY total_sales DESC
"""

print("Executing Report 1: Total Sales per Store...")
df_sales_store = run_query(query_1, database, athena_output_loc)

if df_sales_store is not None:
    print(df_sales_store)

## Report 2: Top Selling Products
We identify which products generate the most revenue.

In [None]:
query_2 = """
SELECT 
    p.product_name, 
    SUM(f.line_total) as total_sales
FROM fact_sales f
JOIN dim_product p ON f.product_wid = p.row_wid
GROUP BY p.product_name
ORDER BY total_sales DESC
LIMIT 10
"""

print("Executing Report 2: Top Selling Products...")
df_top_products = run_query(query_2, database, athena_output_loc)

if df_top_products is not None:
    print(df_top_products)

## Report 3: Sales by Plan Type
We analyze sales distribution based on Customer Plan Types (Gold, Silver, etc.).
*   **Join Path:** `fact_sales` -> `dim_customer` -> `dim_plan_type` (if it exists, or direct attribute if denormalized).
*   *Note: Based on the video, Plan Type might be an attribute in Customer or a separate lookup.*

In [None]:
# Assuming Plan Type is an attribute in dim_customer for this query
query_3 = """
SELECT 
    c.plan_type, 
    SUM(f.line_total) as total_sales
FROM fact_sales f
JOIN dim_customer c ON f.customer_wid = c.row_wid
WHERE c.active_flg = 'Y' -- Only consider current customer attributes
GROUP BY c.plan_type
ORDER BY total_sales DESC
"""

print("Executing Report 3: Sales by Customer Plan Type...")
df_plan_sales = run_query(query_3, database, athena_output_loc)

if df_plan_sales is not None:
    print(df_plan_sales)

## Conclusion
We have successfully:
1.  Ingested raw data from CSV/JSON into a Data Lake.
2.  Processed and modeled data using PySpark and Delta Lake.
3.  Served the data via AWS Athena for Analytics.

This completes the **Data Warehousing with PySpark** course.