### Setup Amazon Athena Database & AWS Glue Crawler

### Description:  
This notebook automates the setup of Amazon Athena database and AWS Glue crawler. 

It creates Amazon Athena database, configures an S3 bucket as the data source, and sets up an AWS Glue crawler to catalog the data. Once the crawler runs, it populates the AWS Glue Data Catalog with table metadata, enabling seamless querying of data using Athena. 

This setup is essential for performing serverless SQL queries on structured and semi-structured data stored in Amazon S3.

![Text to SQL](./glue.png)

In [1]:
import json
with open("../variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '791677101579',
 'regionName': 'us-west-2',
 'collectionArn': 'arn:aws:aoss:us-west-2:791677101579:collection/u99a2f111uq506nobq6l',
 'collectionId': 'u99a2f111uq506nobq6l',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::791677101579:role/advanced-rag-workshop-bedrock_execution_role-us-west-2',
 's3Bucket': '791677101579-us-west-2-advanced-rag-workshop',
 'kbFixedChunk': '2OLAU6UCAW',
 'kbSemanticChunk': 'SCMPE1YU8Y',
 'kbHierarchicalChunk': 'UKZ63LEW5P',
 'kbCustomChunk': 'P55X5UTFYK',
 'sagemakerLLMEndpoint': 'endpoint-llama-3-2-3b-instruct-2025-04-22-19-37-32',
 'guardrail_id': 'a3z8dptcpo5h',
 'guardrail_version': '1'}

In [2]:
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5


# Uploading File to S3

For this lab, we will use sample data files provided in [AWS Big Data Blog](https://aws.amazon.com/blogs/big-data/joining-across-data-sources-on-amazon-quicksight/). The data consists of two tables - Order and Returns. The tables are joined by a common key - Order Id.

In [3]:
# A helper functino to upload CSV files from Excel files.
import boto3
import pandas as pd
from io import BytesIO

def excel_to_csv(source_bucket, source_key, target_bucket, target_prefix):
    """
    Convert Excel file from S3 to CSV and save back to S3.
    
    Args:
        source_bucket (str): Source S3 bucket name
        source_key (str): Source file key (path to xlsx file)
        target_bucket (str): Target S3 bucket name
        target_prefix (str): Target prefix (folder) for CSV files
    """
    # Initialize S3 client
    s3_client = boto3.client('s3')
    # Read the Excel file from S3
    response = s3_client.get_object(Bucket=source_bucket, Key=source_key)

    excel_data = response['Body'].read()

    # Load Excel file into pandas
    excel_file = pd.ExcelFile(BytesIO(excel_data))

    # Process each sheet
    for sheet_name in excel_file.sheet_names:
        # Read the sheet into a DataFrame
        df = pd.read_excel(excel_file, sheet_name=sheet_name)
        df.columns = df.columns.str.replace(' ', '_')
        df.columns = df.columns.str.strip()
        df.columns = df.columns.str.lower()
        print(df.head(10))

        # Generate target key for the CSV file
        target_key = f"{target_prefix}/{sheet_name}.csv"
        df.to_csv(f"s3://{target_bucket}/{target_key}", index=True,
                  index_label=f"{sheet_name}_index")

        print(f"Successfully converted sheet '{sheet_name}' to CSV")


In [4]:
import boto3

order_table_s3_key = "artifacts/aws-blog-joining-across-quicksight/orders.xlsx"
returns_table_s3_key = "artifacts/aws-blog-joining-across-quicksight/returns.xlsx"
source_bucket_name = "aws-bigdata-blog"
target_bucket_name = variables['s3Bucket']  # The name of your S3 bucket

excel_to_csv(source_bucket_name, order_table_s3_key, target_bucket_name, "transactions/order")
excel_to_csv(source_bucket_name, returns_table_s3_key, target_bucket_name, "transactions/returns")

   row_id         order_id order_date  ship_date       ship_mode customer_id  \
0   32298   CA-2012-124891 2012-07-31 2012-07-31        Same Day    RH-19495   
1   26341    IN-2013-77878 2013-02-05 2013-02-07    Second Class    JR-16210   
2   25330    IN-2013-71249 2013-10-17 2013-10-18     First Class    CR-12730   
3   13524  ES-2013-1579342 2013-01-28 2013-01-30     First Class    KM-16375   
4   47221     SG-2013-4320 2013-11-05 2013-11-06        Same Day     RH-9495   
5   22732    IN-2013-42360 2013-06-28 2013-07-01    Second Class    JM-15655   
6   30570    IN-2011-81826 2011-11-07 2011-11-09     First Class    TS-21340   
7   31192    IN-2012-86369 2012-04-14 2012-04-18  Standard Class    MB-18085   
8   40155   CA-2014-135909 2014-10-14 2014-10-21  Standard Class    JW-15220   
9   40936   CA-2012-116638 2012-01-28 2012-01-31    Second Class    JH-15985   

      customer_name      segment           city            state  ...  \
0       Rick Hansen     Consumer  New York Cit

# IAM Role Creation and Policy Attachment

In [5]:
import boto3
import json
import time

def create_iam_role(role_name: str):
    iam_client = boto3.client('iam')
    
    try:
        # Check if the role already exists
        response = iam_client.get_role(RoleName=role_name)
        print(f"Role {role_name} already exists.")
        return response
    except iam_client.exceptions.NoSuchEntityException:
        # Trust policy for Glue service to assume the role
        trust_policy = {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "glue.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }
        
        # Create the IAM role
        response = iam_client.create_role(
            RoleName=role_name,
            AssumeRolePolicyDocument=json.dumps(trust_policy),
            Description="IAM Role for AWS Glue to access S3 and Athena"
        )
        print(f"Role {role_name} created successfully.")
        
        # Wait for role to propagate through AWS systems
        print("Waiting for role to propagate...")
        time.sleep(10)
        return response


# Attaching Inline Policy to IAM Role

In [6]:
def attach_inline_policy_to_role(role_name: str, athena_db_name: str, path_to_the_folder: str, s3_bucket: str):
    iam_client = boto3.client('iam')
    region = boto3.session.Session().region_name
    account_id = boto3.client('sts').get_caller_identity()['Account']

    inline_policy = {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "glue:GetTable",
                    "glue:GetTableVersion",
                    "glue:GetTableVersions",
                    "glue:GetDatabase",
                    "glue:CreateTable",
                    "glue:UpdateTable",
                    "glue:DeleteTable",
                    "glue:GetCrawler",
                    "glue:StartCrawler",
                    "glue:GetCrawlerMetrics"
                ],
                "Resource": [
                    f"arn:aws:glue:region:{variables['accountNumber']}:catalog",
                    f"arn:aws:glue:region:{variables['accountNumber']}:database/{athena_db_name}",
                    f"arn:aws:glue:region:{variables['accountNumber']}:table/{athena_db_name}/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject"
                ],
                "Resource": [
                    f"arn:aws:s3:::{variables['s3Bucket']}/{path_to_the_folder}/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "athena:StartQueryExecution",
                    "athena:GetQueryResults",
                    "athena:GetQueryExecution"
                ],
                "Resource": f"arn:aws:athena:region:{variables['accountNumber']}:workgroup/primary"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                ],
                "Resource": f"arn:aws:logs:region:{variables['accountNumber']}:log-group:/aws/glue/*"
            }
        ]
    }
    
    # Attach the inline policy to the role
    try:
        iam_client.put_role_policy(
            RoleName=role_name,
            PolicyName='GlueCrawlerPolicy',
            PolicyDocument=json.dumps(inline_policy)
        )
        print(f"Inline policy attached to role {role_name} successfully.")
        
        # Also attach AWS managed policy for Glue
        iam_client.attach_role_policy(
            RoleName=role_name,
            PolicyArn='arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole'
        )
        print("Attached AWSGlueServiceRole managed policy")
        
        # Wait for policy to propagate
        print("Waiting for policies to propagate...")
        time.sleep(10)
    except Exception as e:
        print(f"Error attaching inline policy to role {role_name}: {e}")


# Creating Amazon Athena Database

In [7]:
def create_athena_database(athena_db_name: str, s3_bucket: str):
    athena_client = boto3.client('athena', region_name=variables['regionName'])
    
    # First create the results directory if it doesn't exist
    s3_client = boto3.client('s3', region_name=variables['regionName'])
    try:
        s3_client.put_object(
            Bucket=s3_bucket,
            Key='athena-query-results/',
            Body=''
        )
        print(f"Created Athena results directory in bucket {s3_bucket}")
    except Exception as e:
        print(f"Note: {e}")

    try:
        # Create Athena database
        query = f"CREATE DATABASE IF NOT EXISTS {athena_db_name}"
        response = athena_client.start_query_execution(
            QueryString=query,
            ResultConfiguration={
                'OutputLocation': f's3://{s3_bucket}/athena-query-results/'
            }
        )
        
        # Wait for query execution
        query_id = response['QueryExecutionId']
        print(f"Started Athena database creation, execution ID: {query_id}")
        
        # Wait for completion
        status = 'RUNNING'
        while status in ['RUNNING', 'QUEUED']:
            time.sleep(5)
            result = athena_client.get_query_execution(QueryExecutionId=query_id)
            status = result['QueryExecution']['Status']['State']
        
        if status == 'SUCCEEDED':
            print(f"Athena database {athena_db_name} created successfully")
        else:
            print(f"Athena database creation failed with status: {status}")
        
        return response
    except Exception as e:
        print(f"Error creating Athena database: {e}")


# Creating and Starting Glue Crawler

In [8]:
def create_glue_crawler(crawler_name: str, s3_input_bucket: str, path_to_the_folder: str, role_name: str, athena_db_name: str):
    glue_client = boto3.client('glue', region_name=variables['regionName'])
    
    try:
        # Check if the crawler already exists
        glue_client.get_crawler(Name=crawler_name)
        print(f"Crawler {crawler_name} already exists.")
    except glue_client.exceptions.EntityNotFoundException:
        try:
            # Create Glue Crawler with dynamic values
            response = glue_client.create_crawler(
                Name=crawler_name,
                Role=role_name,
                DatabaseName=athena_db_name,
                Targets={
                    'S3Targets': [
                        {
                            'Path': f's3://{s3_input_bucket}/{path_to_the_folder}',
                            'Exclusions': []
                        }
                    ]
                },
                TablePrefix='retail_',
                Description='Crawler for retail transactions data'
            )
            print(f"Crawler {crawler_name} created successfully.")
            return response
        except Exception as e:
            print(f"Error creating crawler: {e}")
            # Print role info for debugging
            iam_client = boto3.client('iam')
            try:
                role_info = iam_client.get_role(RoleName=role_name)
                print(f"Role ARN: {role_info['Role']['Arn']}")
            except Exception as re:
                print(f"Error getting role details: {re}")
    
    return None


def start_glue_crawler(crawler_name: str):
    glue_client = boto3.client('glue', region_name=variables['regionName'])

    try:
        # Start the Glue Crawler execution
        glue_client.start_crawler(Name=crawler_name)
        print(f"Crawler {crawler_name} started successfully.")
    except glue_client.exceptions.CrawlerRunningException:
        print(f"Crawler {crawler_name} is already running.")
    except Exception as e:
        print(f"Error starting crawler: {e}")
        return

    # Wait for the crawler to finish
    print("Waiting for crawler to complete...")
    while True:
        try:
            response = glue_client.get_crawler(Name=crawler_name)
            status = response['Crawler']['State']
            if status == 'READY':
                print(f'Glue Crawler {crawler_name} has completed successfully.')
                break
            elif status == 'RUNNING':
                print(f'Glue Crawler {crawler_name} is still running...')
                time.sleep(10)  # Wait for 10 seconds before checking again
            else:
                print(f'Glue Crawler {crawler_name} status: {status}')
                time.sleep(10)
        except Exception as e:
            print(f"Error checking crawler status: {e}")
            time.sleep(10)


# Execute the above Methods to create DB & Tables

In [9]:
# Get the S3 bucket name from variables
try:
    s3_bucket = variables['s3Bucket']
except (NameError, KeyError):
    # If variables dictionary doesn't exist or doesn't have s3Bucket
    s3_client = boto3.client('s3')
    response = s3_client.list_buckets()
    buckets = [bucket['Name'] for bucket in response['Buckets']]
    print(f"Available buckets: {buckets}")
    s3_bucket = input("Please enter your S3 bucket name: ")

# Inputs from user (or dynamically provided)
role_name = "advanced-rag-workshop-glue-role"
crawler_name = "advanced-rag-workshop-glue-crawler"
path_to_the_folder = "transactions"
athena_db_name = "retail"

print(f"Using S3 bucket: {s3_bucket}")

# Create IAM role
role_response = create_iam_role(role_name)

# Attach inline policy to IAM role
attach_inline_policy_to_role(role_name, athena_db_name, path_to_the_folder, s3_bucket)

# Create Athena Database
create_athena_database(athena_db_name, s3_bucket)

# Create Glue crawler
create_glue_crawler(crawler_name, s3_bucket, path_to_the_folder, role_name, athena_db_name)

# Start Glue crawler
start_glue_crawler(crawler_name)

Using S3 bucket: 791677101579-us-west-2-advanced-rag-workshop
Role advanced-rag-workshop-glue-role created successfully.
Waiting for role to propagate...
Inline policy attached to role advanced-rag-workshop-glue-role successfully.
Attached AWSGlueServiceRole managed policy
Waiting for policies to propagate...
Created Athena results directory in bucket 791677101579-us-west-2-advanced-rag-workshop
Started Athena database creation, execution ID: 4322a24d-cf28-4ef0-9b27-aa82fb23df2e
Athena database retail created successfully
Crawler advanced-rag-workshop-glue-crawler created successfully.
Crawler advanced-rag-workshop-glue-crawler started successfully.
Waiting for crawler to complete...
Glue Crawler advanced-rag-workshop-glue-crawler is still running...
Glue Crawler advanced-rag-workshop-glue-crawler is still running...
Glue Crawler advanced-rag-workshop-glue-crawler is still running...
Glue Crawler advanced-rag-workshop-glue-crawler is still running...
Glue Crawler advanced-rag-workshop-

## Query Athena and Display Results as a Table

#### Two tables have been created during the above step. 

### 1. Order table

In [10]:
import boto3
import pandas as pd
import time

# AWS Configuration
ATHENA_DATABASE = athena_db_name
ATHENA_TABLE = "retail_order" 
S3_OUTPUT_LOCATION = f"s3://{variables['s3Bucket']}/athena-query-results/"
AWS_REGION = "us-west-2"

# Initialize Athena client
athena_client = boto3.client("athena", region_name=AWS_REGION)

# Define Query
query = f"SELECT * FROM {ATHENA_TABLE} LIMIT 10;"  # Modify query as needed

# Start Query Execution
response = athena_client.start_query_execution(
    QueryString=query,
    QueryExecutionContext={"Database": ATHENA_DATABASE},
    ResultConfiguration={"OutputLocation": S3_OUTPUT_LOCATION},
)

# Get Query Execution ID
query_execution_id = response["QueryExecutionId"]

# Wait for Query to Complete
while True:
    status = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
    state = status["QueryExecution"]["Status"]["State"]
    if state in ["SUCCEEDED", "FAILED", "CANCELLED"]:
        break
    time.sleep(2)  # Wait before checking again

# Check if Query Succeeded
if state == "SUCCEEDED":
    # Fetch Results
    results = athena_client.get_query_results(QueryExecutionId=query_execution_id)

    # Extract Column Names
    columns = [col["Label"] for col in results["ResultSet"]["ResultSetMetadata"]["ColumnInfo"]]

    # Extract Row Data
    rows = [
        [col.get("VarCharValue", "") for col in row["Data"]]
        for row in results["ResultSet"]["Rows"][1:]  # Skip header row
    ]

    # Convert to DataFrame
    df = pd.DataFrame(rows, columns=columns)

    # Display as Table
    display(df)

else:
    print(f"Query failed with state: {state}")


Unnamed: 0,orders_index,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,city,...,product_id,category,sub-category,product_name,sales,quantity,discount,profit,shipping_cost,order_priority
0,0,32298,CA-2012-124891,2012-07-31,2012-07-31,Same Day,RH-19495,Rick Hansen,Consumer,New York City,...,TEC-AC-10003033,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.65,7,0.0,762.1844999999998,933.57,Critical
1,1,26341,IN-2013-77878,2013-02-05,2013-02-07,Second Class,JR-16210,Justin Ritter,Corporate,Wollongong,...,FUR-CH-10003950,Furniture,Chairs,"""Novimex Executive Leather Armchair",,3709,9.0,0.1,-288.765,923.63
2,2,25330,IN-2013-71249,2013-10-17,2013-10-18,First Class,CR-12730,Craig Reiter,Consumer,Brisbane,...,TEC-PH-10004664,Technology,Phones,"""Nokia Smart Phone",,5175,9.0,0.1,919.9709999999995,915.49
3,3,13524,ES-2013-1579342,2013-01-28,2013-01-30,First Class,KM-16375,Katherine Murray,Home Office,Berlin,...,TEC-PH-10004583,Technology,Phones,"""Motorola Smart Phone",,2892,5.0,0.1,-96.54000000000003,910.16
4,4,47221,SG-2013-4320,2013-11-05,2013-11-06,Same Day,RH-9495,Rick Hansen,Consumer,Dakar,...,TEC-SHA-10000501,Technology,Copiers,"""Sharp Wireless Fax",,2832,8.0,0.0,311.52,903.04
5,5,22732,IN-2013-42360,2013-06-28,2013-07-01,Second Class,JM-15655,Jim Mitchum,Corporate,Sydney,...,TEC-PH-10000030,Technology,Phones,"""Samsung Smart Phone",,2862,5.0,0.1,763.2750000000001,897.35
6,6,30570,IN-2011-81826,2011-11-07,2011-11-09,First Class,TS-21340,Toby Swindell,Consumer,Porirua,...,FUR-CH-10004050,Furniture,Chairs,"""Novimex Executive Leather Armchair",,1822,4.0,0.0,564.84,894.77
7,7,31192,IN-2012-86369,2012-04-14,2012-04-18,Standard Class,MB-18085,Mick Brown,Consumer,Hamilton,...,FUR-TA-10002958,Furniture,Tables,"""Chromcraft Conference Table",,5244,6.0,0.0,996.48,878.38
8,8,40155,CA-2014-135909,2014-10-14,2014-10-21,Standard Class,JW-15220,Jane Waco,Corporate,Sacramento,...,OFF-BI-10003527,Office Supplies,Binders,Fellowes PB500 Electric Punch Plastic Comb Bin...,5083.96,5,0.2,1906.485,867.69,Low
9,9,40936,CA-2012-116638,2012-01-28,2012-01-31,Second Class,JH-15985,Joseph Holt,Consumer,Concord,...,FUR-TA-10000198,Furniture,Tables,Chromcraft Bull-Nose Wood Oval Conference Tabl...,4297.644,13,0.4,-1862.3124,865.74,Critical


### 2. Returns table

In [11]:
import boto3
import pandas as pd
import time

# AWS Configuration
ATHENA_DATABASE = athena_db_name
ATHENA_TABLE = "retail_returns"
S3_OUTPUT_LOCATION = f"s3://{variables['s3Bucket']}/athena-query-results/"
AWS_REGION = "us-west-2"

# Initialize Athena client
athena_client = boto3.client("athena", region_name=AWS_REGION)

# Define Query
query = f"SELECT * FROM {ATHENA_TABLE} LIMIT 10;"  # Modify query as needed

# Start Query Execution
response = athena_client.start_query_execution(
    QueryString=query,
    QueryExecutionContext={"Database": ATHENA_DATABASE},
    ResultConfiguration={"OutputLocation": S3_OUTPUT_LOCATION},
)

# Get Query Execution ID
query_execution_id = response["QueryExecutionId"]

# Wait for Query to Complete
while True:
    status = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
    state = status["QueryExecution"]["Status"]["State"]
    if state in ["SUCCEEDED", "FAILED", "CANCELLED"]:
        break
    time.sleep(2)  # Wait before checking again

# Check if Query Succeeded
if state == "SUCCEEDED":
    # Fetch Results
    results = athena_client.get_query_results(QueryExecutionId=query_execution_id)

    # Extract Column Names
    columns = [col["Label"] for col in results["ResultSet"]["ResultSetMetadata"]["ColumnInfo"]]

    # Extract Row Data
    rows = [
        [col.get("VarCharValue", "") for col in row["Data"]]
        for row in results["ResultSet"]["Rows"][1:]  # Skip header row
    ]

    # Convert to DataFrame
    df = pd.DataFrame(rows, columns=columns)

    # Display as Table
    display(df)

else:
    print(f"Query failed with state: {state}")


Unnamed: 0,returns_index,returned,order_id,market
0,0,Yes,MX-2013-168137,LATAM
1,1,Yes,US-2011-165316,LATAM
2,2,Yes,ES-2013-1525878,EU
3,3,Yes,CA-2013-118311,United States
4,4,Yes,ES-2011-1276768,EU
5,5,Yes,MX-2013-131247,LATAM
6,6,Yes,ID-2011-20975,APAC
7,7,Yes,IN-2014-58460,APAC
8,8,Yes,ES-2011-3028321,EU
9,9,Yes,MX-2014-148285,LATAM
