# AWS Glue - Comprehensive Guide for Data Engineers

This notebook provides a complete walkthrough of AWS Glue, Amazon's serverless ETL (Extract, Transform, Load) service.

---

## Table of Contents

1. **Introduction & Core Concepts** - What is Glue, architecture, components
2. **Setup & Prerequisites** - IAM roles, boto3 configuration
3. **Data Catalog Operations** - Databases, tables, metadata
4. **Crawlers** - Schema discovery and management
5. **ETL Jobs** - Spark and Python Shell jobs
6. **Advanced Topics** - Workflows, triggers, optimization

---

## PHASE 1: INTRODUCTION & CORE CONCEPTS

### What is AWS Glue?

AWS Glue is a **fully managed, serverless ETL service** that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning, and application development.

**Key Characteristics:**
- **Serverless**: No infrastructure to manage
- **Pay-per-use**: Charged only for resources consumed during job runs
- **Scalable**: Automatically scales based on workload
- **Integrated**: Works seamlessly with S3, RDS, Redshift, Athena

---

### AWS Glue vs Traditional ETL

| Aspect | Traditional ETL | AWS Glue |
|--------|-----------------|----------|
| Infrastructure | Self-managed servers | Serverless |
| Scaling | Manual capacity planning | Automatic |
| Cost Model | Fixed infrastructure costs | Pay-per-use |
| Setup Time | Days to weeks | Minutes |
| Schema Management | Manual | Automated (Crawlers) |

### AWS Glue Architecture Overview

```
+-----------------------------------------------------------------------------------+
|                              AWS GLUE ARCHITECTURE                                |
+-----------------------------------------------------------------------------------+
|                                                                                   |
|   DATA SOURCES                      AWS GLUE                      DATA TARGETS    |
|   +-------------+                                                +-------------+  |
|   |    S3       |                                                |    S3       |  |
|   | (Data Lake) |---+                                        +---|  (Parquet)  |  |
|   +-------------+   |                                        |   +-------------+  |
|                     |    +---------------------------+       |                    |
|   +-------------+   |    |                           |       |   +-------------+  |
|   |    RDS      |---+--->|      ETL JOBS             |-------+-->|  Redshift   |  |
|   | (MySQL/PG)  |   |    |   (Spark / Python)        |       |   | (Warehouse) |  |
|   +-------------+   |    |                           |       |   +-------------+  |
|                     |    +---------------------------+       |                    |
|   +-------------+   |              |                         |   +-------------+  |
|   |  DynamoDB   |---+              |                         +-->|  Athena     |  |
|   +-------------+                  v                             | (Analytics) |  |
|                     +---------------------------+                +-------------+  |
|                     |      DATA CATALOG         |                                 |
|                     |  (Metadata Repository)    |<----+                           |
|                     +---------------------------+     |                           |
|                                    ^                  |                           |
|                     +---------------------------+     |                           |
|                     |        CRAWLERS           |-----+                           |
|                     |   (Schema Discovery)      |                                 |
|                     +---------------------------+                                 |
+-----------------------------------------------------------------------------------+
```

### Key Components of AWS Glue

#### 1. Data Catalog
The **central metadata repository** that stores table definitions, job definitions, and other control information.

```
+------------------------------------------+
|            DATA CATALOG                  |
+------------------------------------------+
|  +----------------+  +----------------+  |
|  |   Database A   |  |   Database B   |  |
|  +----------------+  +----------------+  |
|  | - Table 1      |  | - Table 1      |  |
|  |   - Columns    |  |   - Columns    |  |
|  |   - Location   |  |   - Location   |  |
|  |   - Format     |  |   - Format     |  |
|  | - Table 2      |  | - Table 2      |  |
|  +----------------+  +----------------+  |
+------------------------------------------+
```

**Key Points:**
- Compatible with Apache Hive Metastore
- Accessible from Athena, EMR, Redshift Spectrum
- Stores schema, location, and properties of data

---

#### 2. Crawlers
Automated programs that **scan your data sources** and populate the Data Catalog.

```
DATA SOURCE          CRAWLER              DATA CATALOG
+--------+        +----------+          +------------+
|   S3   | -----> | Analyze  | -------> |   Table    |
| (CSV)  |        | Schema   |          | Definition |
+--------+        +----------+          +------------+
                       |
                       v
               Infers columns,
               data types,
               partitions
```

---

#### 3. ETL Jobs
The actual **data transformation workloads**.

| Type | Engine | Use Case | Min Resources |
|------|--------|----------|---------------|
| Spark | Apache Spark | Large-scale data | 2 DPUs |
| Python Shell | Python | Small datasets | 0.0625 DPU |
| Streaming | Spark Streaming | Real-time | 2 DPUs |

---

#### 4. Workflows
Orchestrate multiple crawlers and jobs into a pipeline.

```
+--------+     +--------+     +--------+     +--------+
| Start  |---->|Crawler |---->|  Job   |---->|  Job   |
+--------+     +--------+     +--------+     +--------+
```

### When to Use AWS Glue

**Good Fit:**
- Building data lakes on S3
- Cataloging data across multiple sources
- Batch ETL processing (hourly, daily)
- Schema discovery and management
- Data preparation for Athena/Redshift

**Consider Alternatives When:**
- Real-time with sub-second latency (use Kinesis)
- Simple file transfers (use DataSync)
- Very small datasets (Lambda might be cheaper)

---

### Cost Model

| Component | Pricing |
|-----------|--------|
| ETL Jobs | $0.44 per DPU-hour |
| Crawlers | $0.44 per DPU-hour |
| Data Catalog | Free (first 1M objects) |

**1 DPU = 4 vCPUs + 16 GB memory**

## PHASE 2: SETUP & PREREQUISITES

### IAM Role Requirements

AWS Glue requires an IAM role with:

```
+------------------------------------------+
|          GLUE IAM ROLE                   |
+------------------------------------------+
|  Trust Policy:                           |
|  - glue.amazonaws.com can assume role    |
|                                          |
|  Permissions:                            |
|  - AWSGlueServiceRole (managed policy)   |
|  - S3 access to your data buckets        |
|  - CloudWatch Logs for logging           |
+------------------------------------------+
```

In [1]:
import boto3
from botocore.exceptions import ClientError
from dotenv import load_dotenv
import os
import json
import time
import re

load_dotenv()

True

In [2]:
# Environment Configuration
ACCESS_KEY = os.getenv("AWS_ACCESS_KEY")
SECRET_KEY = os.getenv("AWS_SECRET_KEY")
AWS_REGION = os.getenv("AWS_REGION")
BUCKET_NAME = os.getenv("AWS_BUCKET_NAME")
GLUE_ROLE_ARN = os.getenv("GLUE_ROLE_ARN")  # Add to your .env file

print(f"Region: {AWS_REGION}")
print(f"Bucket: {BUCKET_NAME}")

Region: us-east-2
Bucket: real-learn-s3


In [3]:
# Initialize Glue Client
glue_client = boto3.client(
    'glue',
    region_name=AWS_REGION,
    aws_access_key_id=ACCESS_KEY,
    aws_secret_access_key=SECRET_KEY
)

# Helper function for security
def redact_account_id(arn):
    """Redact AWS account ID from ARN"""
    return re.sub(r':\d{12}:', ':************:', str(arn))

print("Glue client initialized")

Glue client initialized


## PHASE 3: DATA CATALOG OPERATIONS

### Hierarchy

```
Data Catalog
    +-- Database (logical grouping)
    |       +-- Table (metadata definition)
    |       |       +-- Columns (name, type)
    |       |       +-- Location (S3 path)
    |       |       +-- Format (CSV, Parquet)
    |       |       +-- Partitions (optional)
    |       +-- Table
    +-- Database
            +-- Table
```

---

### Function 1: Create Database

In [4]:
def create_glue_database(database_name, description=''):
    """
    Create a database in the AWS Glue Data Catalog
    
    Args:
        database_name (str): Name (lowercase, no spaces)
        description (str): Optional description
    
    Returns:
        bool: True if created successfully
    """
    try:
        print(f"Creating Glue database '{database_name}'...")
        
        glue_client.create_database(
            DatabaseInput={
                'Name': database_name,
                'Description': description
            }
        )
        
        print(f"SUCCESS: Database '{database_name}' created")
        return True
        
    except ClientError as e:
        if e.response['Error']['Code'] == 'AlreadyExistsException':
            print(f"Database '{database_name}' already exists")
            return True
        print(f"ERROR: {e}")
        return False

# Test
create_glue_database('data_engineering_db', 'Learning database')

Creating Glue database 'data_engineering_db'...
Database 'data_engineering_db' already exists


True

### Function 2: List Databases

In [5]:
def list_glue_databases():
    """
    List all databases in the Glue Data Catalog
    
    Returns:
        list: Database names
    """
    try:
        print("Listing Glue databases...")
        print("-" * 60)
        
        response = glue_client.get_databases()
        databases = response.get('DatabaseList', [])
        
        if not databases:
            print("No databases found")
            return []
        
        print(f"Found {len(databases)} database(s):\n")
        
        for db in databases:
            print(f"Database: {db['Name']}")
            print(f"  Description: {db.get('Description', 'N/A')}")
            print()
        
        return [db['Name'] for db in databases]
        
    except ClientError as e:
        print(f"ERROR: {e}")
        return []

# Test
list_glue_databases()

Listing Glue databases...
------------------------------------------------------------
Found 2 database(s):

Database: data_engineering_db
  Description: Learning database

Database: glue_db
  Description: A glue databse



['data_engineering_db', 'glue_db']

### Function 3: Create Table

In [6]:
def create_glue_table(database_name, table_name, columns, s3_location, 
                      input_format='csv', description=''):
    """
    Create a table in the Glue Data Catalog
    
    Args:
        database_name (str): Target database
        table_name (str): Table name
        columns (list): [{'Name': 'col1', 'Type': 'string'}, ...]
        s3_location (str): S3 path to data
        input_format (str): 'csv', 'json', or 'parquet'
    
    Common Glue Types: string, int, bigint, double, boolean, date, timestamp
    """
    
    format_configs = {
        'csv': {
            'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
                'Parameters': {'field.delim': ',', 'skip.header.line.count': '1'}
            }
        },
        'json': {
            'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.openx.data.jsonserde.JsonSerDe',
                'Parameters': {}
            }
        },
        'parquet': {
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe',
                'Parameters': {}
            }
        }
    }
    
    config = format_configs.get(input_format)
    if not config:
        print(f"ERROR: Unsupported format '{input_format}'")
        return False
    
    try:
        print(f"Creating table '{table_name}' in '{database_name}'...")
        
        glue_client.create_table(
            DatabaseName=database_name,
            TableInput={
                'Name': table_name,
                'Description': description,
                'StorageDescriptor': {
                    'Columns': columns,
                    'Location': s3_location,
                    'InputFormat': config['InputFormat'],
                    'OutputFormat': config['OutputFormat'],
                    'SerdeInfo': config['SerdeInfo']
                },
                'TableType': 'EXTERNAL_TABLE'
            }
        )
        
        print(f"SUCCESS: Table '{table_name}' created")
        return True
        
    except ClientError as e:
        if e.response['Error']['Code'] == 'AlreadyExistsException':
            print(f"Table '{table_name}' already exists")
            return True
        print(f"ERROR: {e}")
        return False

# Example
sample_columns = [
    {'Name': 'id', 'Type': 'int'},
    {'Name': 'name', 'Type': 'string'},
    {'Name': 'created_at', 'Type': 'timestamp'}
]
create_glue_table('data_engineering_db', 'users', sample_columns, f's3://{BUCKET_NAME}/data/users/', 'csv')

Creating table 'users' in 'data_engineering_db'...
Table 'users' already exists


True

### Function 4: List Tables

In [14]:
def list_glue_tables(database_name):
    """
    List all tables in a Glue database
    """
    try:
        print(f"Listing tables in '{database_name}'...")
        print("-" * 60)
        
        response = glue_client.get_tables(DatabaseName=database_name)
        tables = response.get('TableList', [])
        
        if not tables:
            print(f"No tables found in '{database_name}'")
            return []
        
        print(f"Found {len(tables)} table(s):\n")
        
        for table in tables:
            print(f"Table: {table['Name']}")
            print(f"  Location: {table.get('StorageDescriptor', {}).get('Location', 'N/A')}")
            print(f"  Columns: {len(table.get('StorageDescriptor', {}).get('Columns', []))}")
            print()
        
        return [t['Name'] for t in tables]
        
    except ClientError as e:
        print(f"ERROR: {e}")
        return []

# Test
list_glue_tables('data_engineering_db')

Listing tables in 'data_engineering_db'...
------------------------------------------------------------
Found 2 table(s):

Table: raw
  Location: s3://real-learn-s3/raw/
  Columns: 9

Table: users
  Location: s3://real-learn-s3/data/users/
  Columns: 3



['raw', 'users']

## PHASE 4: CRAWLERS

Crawlers automatically discover schema and populate the Data Catalog.

### How Crawlers Work

```
1. CONFIGURE           2. RUN                3. CATALOG
+-------------+       +-------------+       +-------------+
| Define:     | ----> | Crawler     | ----> | Creates/    |
| - S3 path   |       | scans files |       | updates     |
| - IAM role  |       | infers      |       | tables in   |
| - Schedule  |       | schema      |       | catalog     |
+-------------+       +-------------+       +-------------+
```

### Function 5: Create Crawler

In [8]:
def create_glue_crawler(crawler_name, database_name, s3_path, description=''):
    """
    Create a Glue crawler
    
    Args:
        crawler_name (str): Crawler name
        database_name (str): Target database
        s3_path (str): S3 path to crawl (e.g., 's3://bucket/path/')
    """
    try:
        print(f"Creating crawler '{crawler_name}'...")
        print(f"Target: {database_name}")
        print(f"Path: {s3_path}")
        
        glue_client.create_crawler(
            Name=crawler_name,
            Role=GLUE_ROLE_ARN,
            DatabaseName=database_name,
            Description=description,
            Targets={'S3Targets': [{'Path': s3_path}]},
            SchemaChangePolicy={
                'UpdateBehavior': 'UPDATE_IN_DATABASE',
                'DeleteBehavior': 'LOG'
            }
        )
        
        print(f"SUCCESS: Crawler '{crawler_name}' created")
        return True
        
    except ClientError as e:
        if e.response['Error']['Code'] == 'AlreadyExistsException':
            print(f"Crawler '{crawler_name}' already exists")
            return True
        print(f"ERROR: {e}")
        return False

# Example
create_glue_crawler('my-crawler', 'data_engineering_db', f's3://{BUCKET_NAME}/raw/')

Creating crawler 'my-crawler'...
Target: data_engineering_db
Path: s3://real-learn-s3/raw/
SUCCESS: Crawler 'my-crawler' created


True

### Function 6: Run Crawler

In [16]:
def run_crawler(crawler_name, wait=False):
    """
    Start a Glue crawler
    
    Args:
        crawler_name (str): Crawler name
        wait (bool): Wait for completion
    """
    try:
        print(f"Starting crawler '{crawler_name}'...")
        glue_client.start_crawler(Name=crawler_name)
        print(f"Crawler started")
        
        if wait:
            print("Waiting for completion...")
            while True:
                response = glue_client.get_crawler(Name=crawler_name)
                state = response['Crawler']['State']
                
                if state == 'READY':
                    # Wait for stats to update, then re-fetch
                    time.sleep(2)
                    response = glue_client.get_crawler(Name=crawler_name)
                    last = response['Crawler'].get('LastCrawl', {})
                    print(f"Completed: {last.get('Status', 'Unknown')}")
                    break
                print(f"  State: {state}")
                time.sleep(10)
        
        return True
        
    except ClientError as e:
        if e.response['Error']['Code'] == 'CrawlerRunningException':
            print("Crawler already running")
            return True
        print(f"ERROR: {e}")
        return False

# Example
run_crawler('my-crawler', wait=True)

Starting crawler 'my-crawler'...
Crawler started
Waiting for completion...
  State: RUNNING
  State: RUNNING
  State: RUNNING
  State: RUNNING
  State: RUNNING
Completed: SUCCEEDED


True

## PHASE 5: ETL JOBS

### Job Types Comparison

```
+------------------+-------------------+-------------------+
|                  |    SPARK JOB      |  PYTHON SHELL     |
+------------------+-------------------+-------------------+
| Engine           | Apache Spark      | Python runtime    |
| Best For         | Large datasets    | Small datasets    |
| Min Resources    | 2 DPUs            | 0.0625 DPU        |
| Startup Time     | ~1 minute         | Seconds           |
+------------------+-------------------+-------------------+
```

### Function 7: Create Python Shell Job

In [22]:
def create_python_shell_job(job_name, script_location, description=''):
    """
    Create a Python Shell ETL job
    
    Args:
        job_name (str): Job name
        script_location (str): S3 path to Python script
    """
    try:
        print(f"Creating Python Shell job '{job_name}'...")
        
        glue_client.create_job(
            Name=job_name,
            Description=description,
            Role=GLUE_ROLE_ARN,
            Command={
                'Name': 'pythonshell',
                'ScriptLocation': script_location,
                'PythonVersion': '3.9'
            },
            DefaultArguments={
                '--TempDir': f's3://{BUCKET_NAME}/glue-temp/',
                '--job-language': 'python'
            },
            MaxCapacity=0.0625,
            Timeout=60,
            GlueVersion='3.0'
        )
        
        print(f"SUCCESS: Job '{job_name}' created")
        return True
        
    except ClientError as e:
        if e.response['Error']['Code'] == 'AlreadyExistsException':
            print(f"Job '{job_name}' already exists")
            return True
        print(f"ERROR: {e}")
        return False

In [25]:
# Create job
create_python_shell_job(
    'my-etl-job',
    f's3://{BUCKET_NAME}/scripts/test_job.py',
    description='Test ETL job'
)

Creating Python Shell job 'my-etl-job'...
SUCCESS: Job 'my-etl-job' created


True

### Function 8: Create Spark Job

In [23]:
def create_spark_job(job_name, script_location, description='', 
                     worker_type='G.1X', num_workers=2):
    """
    Create a Spark ETL job
    
    Worker Types:
        - G.025X: 2 vCPU, 4 GB (small jobs)
        - G.1X: 4 vCPU, 16 GB (standard)
        - G.2X: 8 vCPU, 32 GB (memory-intensive)
    """
    try:
        print(f"Creating Spark job '{job_name}'...")
        print(f"Workers: {num_workers} x {worker_type}")
        
        glue_client.create_job(
            Name=job_name,
            Description=description,
            Role=GLUE_ROLE_ARN,
            Command={
                'Name': 'glueetl',
                'ScriptLocation': script_location,
                'PythonVersion': '3'
            },
            DefaultArguments={
                '--TempDir': f's3://{BUCKET_NAME}/glue-temp/',
                '--enable-metrics': 'true',
                '--enable-continuous-cloudwatch-log': 'true'
            },
            WorkerType=worker_type,
            NumberOfWorkers=num_workers,
            Timeout=120,
            GlueVersion='4.0'
        )
        
        print(f"SUCCESS: Job '{job_name}' created")
        return True
        
    except ClientError as e:
        if e.response['Error']['Code'] == 'AlreadyExistsException':
            print(f"Job '{job_name}' already exists")
            return True
        print(f"ERROR: {e}")
        return False

### Function 9: Run Job

In [26]:
def run_glue_job(job_name, arguments=None, wait=False):
    """
    Start a Glue job
    
    Args:
        job_name (str): Job name
        arguments (dict): Optional job arguments
        wait (bool): Wait for completion
    
    Returns:
        str: Job run ID
    """
    try:
        print(f"Starting job '{job_name}'...")
        
        params = {'JobName': job_name}
        if arguments:
            params['Arguments'] = arguments
        
        response = glue_client.start_job_run(**params)
        run_id = response['JobRunId']
        print(f"Run ID: {run_id}")
        
        if wait:
            print("Waiting for completion...")
            while True:
                status = glue_client.get_job_run(JobName=job_name, RunId=run_id)
                state = status['JobRun']['JobRunState']
                
                if state in ['SUCCEEDED', 'FAILED', 'STOPPED', 'TIMEOUT']:
                    print(f"\nCompleted: {state}")
                    if state == 'FAILED':
                        print(f"Error: {status['JobRun'].get('ErrorMessage')}")
                    break
                print(f"  State: {state}")
                time.sleep(15)
        
        return run_id
        
    except ClientError as e:
        print(f"ERROR: {e}")
        return None

# Example
run_glue_job('my-etl-job', wait=True)

Starting job 'my-etl-job'...
Run ID: jr_94ce13623351cff864d62455474322809fd22ab37296406b95ef67e632f3c1c2
Waiting for completion...
  State: RUNNING
  State: RUNNING
  State: RUNNING

Completed: SUCCEEDED


'jr_94ce13623351cff864d62455474322809fd22ab37296406b95ef67e632f3c1c2'

### Function 10: List Jobs

In [27]:
def list_glue_jobs():
    """
    List all Glue jobs
    """
    try:
        print("Listing Glue jobs...")        
        response = glue_client.get_jobs()
        jobs = response.get('Jobs', [])
        
        if not jobs:
            print("No jobs found")
            return []
        
        print(f"Found {len(jobs)} job(s):\n")
        
        for job in jobs:
            print(f"Job: {job['Name']}")
            print(f"Type: {job['Command']['Name']}")
            print()
        
        return [j['Name'] for j in jobs]
        
    except ClientError as e:
        print(f"ERROR: {e}")
        return []

# Test
list_glue_jobs()

Listing Glue jobs...
Found 2 job(s):

Job: aws-glue-pipeline
Type: glueetl

Job: my-etl-job
Type: pythonshell



['aws-glue-pipeline', 'my-etl-job']

### Sample Spark ETL Script

```python
# sample_etl.py - Upload to S3
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="raw_data"
)

# Transform
filtered = Filter.apply(frame=datasource, f=lambda x: x['status'] == 'active')

# Write as Parquet
glueContext.write_dynamic_frame.from_options(
    frame=filtered,
    connection_type="s3",
    connection_options={"path": "s3://bucket/processed/"},
    format="parquet"
)

job.commit()
```

## PHASE 6: ADVANCED TOPICS

### Job Bookmarks

Enable incremental processing - only process new data:

```
WITHOUT BOOKMARKS              WITH BOOKMARKS

Run 1: [A, B, C]               Run 1: [A, B, C] --> Bookmark: C
Run 2: [A, B, C, D, E]         Run 2: [D, E]    --> Bookmark: E
       (processes all)                (only new!)
```

Enable with: `--job-bookmark-option job-bookmark-enable`

---

### Cost Optimization Tips

1. **Use Python Shell for small jobs** - 0.0625 DPU vs 2 DPU minimum
2. **Enable job bookmarks** - Process only new data
3. **Use G.025X workers** - For jobs that don't need much memory
4. **Set appropriate timeouts** - Avoid runaway jobs
5. **Partition your data** - Faster crawling and querying

## Cleanup Functions

In [28]:
def delete_glue_database(database_name):
    """Delete a Glue database (must be empty)"""
    try:
        glue_client.delete_database(Name=database_name)
        print(f"Deleted database '{database_name}'")
        return True
    except ClientError as e:
        print(f"ERROR: {e}")
        return False

def delete_glue_table(database_name, table_name):
    """Delete a table"""
    try:
        glue_client.delete_table(DatabaseName=database_name, Name=table_name)
        print(f"Deleted table '{table_name}'")
        return True
    except ClientError as e:
        print(f"ERROR: {e}")
        return False

def delete_glue_crawler(crawler_name):
    """Delete a crawler"""
    try:
        glue_client.delete_crawler(Name=crawler_name)
        print(f"Deleted crawler '{crawler_name}'")
        return True
    except ClientError as e:
        print(f"ERROR: {e}")
        return False

def delete_glue_job(job_name):
    """Delete a job"""
    try:
        glue_client.delete_job(JobName=job_name)
        print(f"Deleted job '{job_name}'")
        return True
    except ClientError as e:
        print(f"ERROR: {e}")
        return False

## Summary

| Phase | Topics |
|-------|--------|
| 1 | Core concepts, architecture, components |
| 2 | Setup, IAM roles, boto3 configuration |
| 3 | Data Catalog: databases, tables, metadata |
| 4 | Crawlers: automatic schema discovery |
| 5 | ETL Jobs: Python Shell and Spark |
| 6 | Advanced: bookmarks, optimization |