# 🧭 Developing a Data Catalog with AWS Glue Crawlers

**Mode:** Console-only • **Region:** `us-east-1`

### 🎯 Objective
Build a Glue **Data Catalog** by running a **Crawler** over CSV data in S3.

### Why this matters
Glue Crawlers automatically discover schema (columns, data types, partitions) and create tables in the **Glue Data Catalog** that services like **Athena**, **Redshift Spectrum**, and **EMR** can query.


## ✅ Prerequisites
- AWS account with access to **S3, IAM, Glue**.
- Create an S3 bucket (globally unique): **`glue-lab24-crawler-us-east-1`**
- Upload your files into: `s3://glue-lab24-crawler-us-east-1/sales-data/`
  - `orders_with_header.csv`
  - `customers_with_header.csv`
- You’ll create an IAM role **AWSGlueServiceRoleLab24** below.


## 1) Create IAM role for the crawler
1. AWS Console → **IAM → Roles → Create role**  
2. **Trusted entity**: AWS service → **Glue**  
3. **Permissions**: attach
   - `AWSGlueServiceRole`
   - `AmazonS3FullAccess` *(or a least‑privilege policy to your bucket path)*  
4. **Role name**: `AWSGlueServiceRoleLab24`


## 2) Create a new Crawler
1. AWS Console → **AWS Glue → Crawlers → Create crawler**
2. **Name**: `sales-data-crawler24`
3. **Data source**: **S3**
4. **S3 path**: `s3://glue-lab24-crawler-us-east-1/sales-data/`
5. **IAM role**: *Choose existing* → `AWSGlueServiceRoleLab24`
6. **Target** → **Create a database**: `sales_data_db24`
7. **Schedule**: *On demand*
8. **Table name changes**: keep defaults
9. **Create crawler**


## 3) Run the crawler and verify tables
1. Select **sales-data-crawler24 → Run**. Wait for **Completed**.
2. Go to **Glue → Data Catalog → Databases → sales_data_db24 → Tables**.
3. You should see tables inferred for your CSVs (example: `orders_with_header`, `customers_with_header`).
4. Open a table → **Schema** to review columns and data types.

### Notes
- If headers aren’t detected or delimiters differ, you’ll fix that in **Lab 25 (Custom Classifier)**.


## 🧹 Cleanup
- Stop/delete the crawler if no longer needed.
- Delete the IAM roles and S3 buckets.


## 🧭 Reflection
- **Crawler** = automated schema discovery.  
- **Data Catalog** = central metadata store for your lakehouse. 